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© A parallel array processor tor massively parallel applications is formed wHh k>w power CMOS with ORAM 
processing while incorporating processing elements on a angle chip. Eight processors on a single chip have 
their own associated processing element, significant memory, and I/O and axe interconnected with a hypeicube 
based, but modified, topology. These node* are then interconnected, either by a hypercu&e, modified hyper- 
cube, or ring, or ring wKhin ring network topology. Conventional microprocessor MMPs consume pins and time 
going fc> memory. The new architecture merges processor and memory with multiple PMEe (eight 19 bit 
processors with 32K and I/O) in DRAM and has no memory access delays and uses all the pins tor networking. 
The chip can be a single node of a fine-grained parallel processor Each chip win have eight 16 bit processors, 
each processor providing 5 mips performance. I/O has three internal ports end one external port shared by the 
plural processors on the chip. Significant software flexibility is provided to enabte quick imptemerriation of 
existing programs written in common languages. H la a developable and expandable technology without need tn 
develop new ptnouts. new software, or new utilities as chip density increases and new hardware is provided for e 
chip function. The scalable chip PME has internal end external connections for broadcast and asynchronous 
SlMD, MIMD and SlMIMD (SlMD/MIMD) with dynamic swtlchtng of modes. The chip can be used in systems 
which employ 32, 04 or 128>000 processor* and can be used for lower, intermediate and higher ranges- Local 
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and global memory Junctions can all be provided by Uie chips themsehres, and (he system can connect to and 
support other global memories and DASD. The chip can be used as a microprocessor accelerator, In personal 
computer applications, as a vision or avionics computer system, or as workstation or supercomputer. There is 
proeram compatibility for Ihe «W)y scalable system. 
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FIELD OF THE INVENTIONS 

The invention relates to computet and compute* systems and particularly to parallel array processor 
In accordance with the invention, a parallel array processor <APAP) may be incorporated on a single 
a semiconductor silicon chip. This chip forms a basis for the systems described which are capable o1 
massively parallel processing of complex scientific and business applications. 

REFERENCES USED IN THE DISCUSSION OF T1IE INVENTIONS 

w In the detailed discussion of ihe invention, other works will ba referenced, including references to our 
own unpublished works which are not Prior Art* which will aid the reader in following the discussion. 

GLOSSARY OF TERMS 

;s o ALU 

ALU Is the artthmelte logic unit portion of a processor, 
o Array 

Array refers to an arrangement of elements in one or more dimensions. An array can include an 
ordered set of data Items {array element) which in languages like Fortran are Identified by a single 

& nam». in other languages such a name of an ordered set of data items Titers io an ordered collection 
or set of data elements, all of which have identical attributes. A program array has dimensions 
specified, generally by a number or dimension attribute. The declarator of the array may also specify 
ih* size of each dimension of ihe an&y in soma* i&nguagss. in some languages, an array is an 
arrangement of elements in a table, (n a hardware sense, an array is a collection of structures 

2$ (functional elements} which are generally Identical In a massively parallel architecture. Array elements 
in data paran&l computing are stemertts which can ba assigned operations and whsn parallel can each 
independently and in parallel execute the operations required. Generally, arrays may be thought of as 
grids of processing elements. Sections of the array may be assigned sectional data, so that sectional 
data can be moved around in a regular grid pattern. However, data can be indexed or earned to an 

so .arbitrary location in an array, 

o Array Director 

An Array Director is a unil programmed as a control tar for an array. It performs the function of a 
master controller for a grouping of functional elements arranged In an array, 
o Array Processor 

3S There two principal types of array processors - multiple mBtruction multiple date (MIMD) and single 

instruction multiple data ($IMD), in a mimd array processor, each processing element in the array 
executes its own unique instruction stream with its own data. In a SWD array processor, each 
processing element in the array is restricted to the same instruction via a cwwnon instruction stream; 
however, the data associated with each processing element Is unique. Our preferred array processor 

<o has other characterises. We call it Advanced Parallel Array Processor, and use the acronym APAP, 
o Asynchronous 

Asynchronous Is without a regular time relationship; the execution of a function is unpredictable with 
respect to the execution of other functions which occur without a regular or predictable tim& 
relationship to other function executions. In control situations, a controller will address a location to 
& which control is passed when data is waiting for an idle element being addressed. TOs permits 
operations to remain in a sequence while they are out of time coincidence with any event 
o BOPS/GOPS 

BOPS or GOPS are acronyms having the same meaning - billions of operations per second. See 
GOPS, 

so o Circuit Switched/Store Forward 

These terms refer to two mechanisms for moving data packets through a network of nodes. Store 
Forward is a mechanism whereby a data packet Is received by each intermediate node, stored into tts 
memory, and then forwarded on towards Its destination. Circuit Switch is a mechanism whereby an 
Intermediate node is commanded to togicany connect tts input port to an output port such that data 

58 packets can pass directly through the nod« towards their destination, without entering the intenmeokie 

node's memory, 
o Cluster 

A cluster is a station (or functional unit) which consists of a control unit (cluster controller) and the 
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hardware (which may be terminals, functional unite, or virtual components) etiached to it. Our Cluster 
Includes an array of PMEs sometimes called a Node array. Usually a cluster hat 512 PMEs, 

Our Entire PME node array connate of a set of cluster each duster supported by a cluster 
controller (CC), 

9 o Cluster controller 

A duster controller is a device thai controls inputfautpu? {VO) operaaona for more man one device or 
functional unit connected to it A cluster comroMer is usually conxroiied by a program stored and 
executed In the unH as tt was in the IBM 3©0i Finance Communication Controller, but it can be 
entirely controlled by hardware as ft was in the IBM 3272 Control Unit 

jo o Cluster synchroniser 

A duster synchronizer is a UincttoneJ unit which manages the operations ot all or part of a cluster to 
maintain synchronous operation ot the elements so ttat the functional unite maintain a particular time 
relationship with the execution of a program, 
o Controller 

15 A oonfroller is a device ttat directs the transmission of data and instructions over the links of an 

Interconnection network: Its operation Is controlled by a program executed by a processor to which 
the controller is connected or by a program executed within the device, 
o CMOS 

CMOS Is an acronym for Complementary Metal-Oxide Semiconductor technology. It is commonly 
20 used to manufacture dynamic random access memories (DRAMs). NMOS is another technology used 
to manufacture DRAMS, We prefer CMOS but the ischnotogy used to manufacture the APAP is no* 
intended to limit the scope or the semiconductor technology which Is employed, 
o Dotting 

Dotting refers to the joining ot three or more leads by physically connecting them together. Most 

29 backpanet busses share this connection approach. The term relates to OR DOTS of times past but la 
used here to identity multiple data sources that can be combined onto ft bus by a very simple 
protocol. 

Our I/O zipper concept can be used to Implement the concept that the port into a node could be 
driven by the port out of a node or by data coming from the system bus. Conversely, data being put 

30 out or a node would be available io both ihe input to another node and to the system bus. Note thai 
outpuiting data to both the system bus and another node is not done simultaneously bu? in different 
cydes. 

Dotting Is used In fre H-OOT discussions where Two-ported PEs or PMEs or Pickets can be used 
in arrays of various organizations by taking advantage of dotting. Several topologies are discussed 
39 including 20 and 3D Meshes. Base 2 N-cube, Sparse Base 4 N-cube, and Sparse Base 8 N-cube. 

0 DRAM 

DRAM is an acronym for dynamic random accsss memory, the common storage used by computers 
for main memory. However, the term DRAM can be applied to use ae a caohe or as a memory which 
is not the main memory. 

* 0 FLOATING-POINT 

A floating-point number is expressed in two parts. There is a fixed point or fraction pari, and an 
exponent part to some assumed radix or Base. The exponent indicates the actual placement of the 
decimal point. In the typical floating-point representation a real number 0.0001234 is represented as 
0.1234-0, where 0.1234 is the frxed-point part and O is the exponent In this example, the floating- 

<$ point radix or base la 10, where 10 represents the implicit fixed positive integer base, greater man 
unity, that is raised to the power explicitly denoted by the exponent In tha floating-point representation 
or represented by the characteristic In the floating* point representation and (hen multiplied by the 
fixed-point part to determine the real number represented. Numeric literals can be expressed in 
floating-point notation as well as real numbers. 

so o FLOPS 

This terms refers to float»>g-point instructione per second, Floating-poim operations include ADD. 
SUB, MPY, DiV and often many others. Flaating*polm instructions per second parameror Is often 
calculated using the add or multiply instructions and. in general, may be considered to have a 50/50 
mix. An operation includes the generation of exponent, fraction and any required fraction normalize- 
S3 tion. We could address 02 or 46-bit floating-point formats (or longer but we have not counted them in 

the mix.) A floating-point operation when implemented with fixed point instructions {normal or RISC} 
requires multiple instructions. Some use a 10 to 1 ratio in figuring performance while some specific 
studies have shown a ratio of 8.25 more appropriate to use. Various architectures will have different 
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ratios, 
o Function*] unit 

A functional unit is an entity of hardware*, software, or both, capable of accomplishing a purpose, 
o Gbytes 

5 Gbytes raters to a bftion bytes. Gbyies/s would be a billion bytes per second, 
o GIGAFLOPS 

fl0r9 floetlng-pokm Instructions per second. 
o GOPSand PETAOP9 

GOP6 or BOPS* hase the same meaning - billions of operations per second PETAOPS means 
jo trillions of operations per second, a potential of the current machine. For our APAP machine they are 
just about the same as BIPa'QtPs meaning bullions of instructions per second. In some machines an 
Instruction can cause two or more operations (te. both an add and multiply) but we don't do that. 
Alternatively it oouid take many instructions to do an op. For example we use multiple insertions to 
perform 64 bit arithmetic- tn counting ops however, we did not elect to count log ops. GOPS may be 
?s the preferred use to describe performance, but there is no consistency in usage that has been noted. 
One sees MIPs/MOPs then BIPe/BOPs and MesaPLOP^'GioaFLOPS.^raFLOPSfPetaFiops. 
o ISA 

ISA means the Instruction Set Architecture, 
o Link 

so A link is an element which may be physical or logical. A physical link is *he physical connection tor 
Joining elements or units, while in computer programming a link 1$ an Instruction or address that 
passes control and parameters between separate portions of the program, m muUfeystems a link is 
the connection between two systems which may be specified by program code Identifying the Dnk 
which may be identified by a real or virtual address. Thus generally a link bwludes She physical 

ss medium, any protocol, and associated devices and programming; ri is both logical and physical, 

o M FLOPS 

MFLOP3 means (10T6 Eoafing-potnt instructions per second, 
o MIMD 

mimd is used to refer to a processor array architecture wherein each processor m the array has its 
50 own KWtrucuon stream, thus Multiple Instruction streem, to execute Muftple Data streams located one 
per processing element, 
o Module 

A module is a program unit that is discrete and identifiable or a functional unit of hardware designed 
for use with other components. AJso, a collection of PGs contained m a single electronic chip is called 
3$ a module, 
o Neds 

Generally, a node is the junction of links. In a generic array oi PEs« one PE can be a node. A node 
can also contain a collection of pes called a module. In accordance wnh out invention a node ie 
formed of an array of PMEs, and we refer to the set oi PMEs as a node. Preferably a node is 8 PMEs. 
40 o Node array 

A collection of modules mads up of PMEs is sometimes referred to as a node array, is an array of 
nodes made up of modules- A node array is usually more than a few PMEs, but the term 
encompasses a plurality, 
o POE 

<6 A PDE is a partial differential equation, 

o PDE relaxation solution process 

PDE relaxation solution process is a way to solve a PDE {partial differential equatkm}. Solving PDEs 
uses most of the super computing compute power in the known universe and can therefore be a good 
example of the relaxation process. There are many ways to solve the PDE equation and more than 

sa one of the numerical methods *ndudes the relaxation process. For example, if a PDE is solved by 
nnite element methods relaxation consumes the bulk of the computing time. Consider an example 
from the world of heat transfer. Given hot gas inside a chimney and a cold wind outside, ho 4 * win the 
temperature gradient within the chimney bricks develop? By considering She bricks as tiny segments 
and writing an equation thai says how heat flows between segments as a function of temperature 

50 differences then the heat transfer PDE has been converted into a finite element problem. H we then 
say all elements except those on the inside and outside are at room temperature white the boundary 
segments are at ihe hot gas and cold wind temperature, we hare set up the problem to begin 
relaxation. The computer program then models time by updating the temperature variable in each 
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segment based upon the amount of hear that flows into or out of the segment ft takes many cycles at 
processing al iho segments in the model before Die eel of temperature variables across the chimney 
relaxes to represent actual temperature distribution thai would occur in the physical chimney. » the 
objective was to mods! gas cooling in the chimney then the otomente would have to extend to gas 

s equations, and the boundary conditions on the Inside would be linked to another finite element model, 

and ma process conSnuee. Note that me heel ilow is dependent upon the temperature difference 
between the segment and lis neighbors, ft thus uses ihe Inter-PE communication paths to distribute 
the temperature variables. It Is this near neighbor communication pattern or characteristic that makes 
PDE relation very applicable to parallel computing 

10 o PICKET 

This is the element in an array of elements maktng up an array processor. It constats of: data flow 
{ALU REGS), memory, control, and the portion erf the communion matrix associated with the 
element. The unit refers to a 1/nih of an array processor made up of para) let processor and memory 
elements w*m their control and portion of the array Intercom municatfcon mechanism. A picket Is a form 

J3 of processor memory element or PME. Our PME chip design processor logic can implement the 

picket logic described in related applications or have the logic for the array of processors formed as a 
node. The term PICKET is similar to the commonly used array term PE for processing element and 
ia an element of the processing array preferably comprised of a combined processing element and 
local memory for processing bit parallel bytes of Information in a dock cycle. The preferred 

20 embodiment consisting of a byte wide data flow processor, 32k byiee or more of memory, prim'rave 
controls and ties to communications with other plckee, 

"me term "plckeT comes from Tom Sawyer and his white fence, although it will also be 
understood functionally that a military picket line analogy Ma quite well 
o Picket Chip 

29 A picket chip contains a plurality of pickets on a single silicon chip, 
o Picket Processor system (or Subsystem) 

A picket processor » a total system consisting of an array of pickets, a comrruinicalion network, an 
\fO system, and a SIMD controller consisting of a microprocessor a canned routine processor, and a 
rrwcro-controiter that runs the array. 

30 o Picket Architecture 

The Picket Architecture is the preferred embodiment for ti>e SIMD architecture writi features that 
accommodate several diverse kinds of problem? including: 

- eel associative processing 

- parallel numerically Intensive processing 

as * physical array processing similar to images 

o Picket Array 

A picket array is a collection of pickets arranged in a geometric order, a regular array, 
o PME or processor memory element 

PME is used foe a processor memory element We use the term PME to refer to a singfe processor, 

4} memory and vo capable system element or unit thai forms one of our parallel array processors. A 
processor memory element is a term which encompasses a picket A processor memory element is 
1/nth of a processor array which comprises a processor, its associated memory, control interface, and 
a portion of an array communication network mechanism. This element can have a processor memory 
element with a connectivity of a regular array, as in a picket processor, or as part of a sub array, as In 

<8 the multi-processor memory element node we have described, 
o Routing 

Routing is (he assignment of a physical path by which a message will reach Its destination. Routing 
assignments have a source or origin and a destination. These elements or addresses have a 
temporary relationship or etnnfty. Often, message routing is based upon a key which Ia obtained by 
so reference to a table of assignments. In a network, a destination is any station or network addressable 

unit addressed as the destination of information frar emitted by a path control address that identities 
the Onk. The destination field Identifies the destination with a message header destination code, 
o $IMD 

A processor array architecture wherein ail processors in the array are commanded from a Single 
so Instruction stream to execute Multiple Data streams located one per processing element. 

0 SIMDMIMD or 8IMQ/MIMD 

SIMDMIMD or SMD/MIMD is a term referring to a machine that has a dual function that can switch 
from MIMD to 8MD for a period of time to handle some complex instruction, and fcus has two 
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modes. The Thinking Machines, Inc. Connection Machine model CM-2 when placed as a Irani end or 
back end of a MIMD machine permitted programmers to operate different modeB for execution of 
different parts of a problem, referred to sornetKnee a dual modes. These machines nave existed since 
Uliac and have employed a bos that interconnects the master CPU with other processors. The master 
9 control processor would hare ihe capability of Interrupting the processing of other CPUs. The other 

CPUs could run independent program coda. During an interruption, some provision must be made for 
checkpointing (closing and saving current status of the controlled processors). 
0 SIMIMD 

SMIMD is & processor array architecture wherein ail processors In the array are commanded from a 
70 dingle Instruction stream, to execute Multiple Data streams located one per processing element. 

within this construct, data dependent operations within each picket that mimic instruction execution 

are controlled by ihe SIMD instruction stream. 

This is a Single Instruction Stream machine with the ability to sequence Multiple Instruction 

streams (one per Picket) using the 5IMD instruction stream and operate on Multiple Data Streams 
;s (one per Picket). SIMIMD can be executed by a processor memory element system, 

SISD 

SISD is an acronym for Single Instruction Single Data. 
2© o Swapping 

Swapping interchanges the data content of a siorage area with that of another area of storage. 

o Synchronous Operation 

Synchronous operation in a MIMD machine is a mode of operation in which each action is related to 
an event (usually a clock): it can be a specified event that occurs regularly in a program sequence. An 
so operation Is dispatched to a number of PEs who then go oft to Independently perform the fcnctlon. 
Control is not returned to the controller until the operation ta complied, if the request is to a* array of 
functional unite, the request is generated by a controller lo elemenls in ihe array which must complete 
their operation before control Is returned to the controller. 

O TBRAFLOPS 

ao TERAFLOPS means (I0>~r2 ftoatlng-polrtt Instructions per second. 

O VLSI 

VLSI is an acronym for very large scale integration {as applied to integrated circuits), 
o Zipper 

A zipper is a new function provided, tr allows for links to be made from devices which are external to 
as the normal interconnection of an array configuration. 

BACKGROUND OF THE INVENTION 

In the never ending quest tor faster computers, engineers ane linking hundreds, and even thousands of 
«o low cost microprocessors togeiher in parallel to create super supercomputers that divide in order to conquer 
complex problems that stump today's machines. Such machines are called massively parallel We have 
created a new way to create massively parallel systems. The many improvements which we have made 
should be considered against the background of many works of others. 

Multiple computers operating in parallel have existed for decades. Early parallel machines included the 
4* illiac which was started in me 1960s, iluac iv was built in the 1970s. Other multiple processors include 
(see a partial summary In U.3. Patent 4£7S,8M Issued December 4, 1890 to Xu ©t al) the Cedar, Sigma-i, 
the Butterfly and the Monardv the Intel ipse, The Connection Machines, the Caltech COSMIC, the N Cube, 
IBM's RP3v IBM's GF11 , the NYU Ultra Computer, the Intel Delta and Touchstone. 

Large multiple processors beginning with ILLIAC have been considered supercomputers. Supercom- 
so puters Willi greatest commercial success have been based upon multiple vector processors, represented by 
the Cray Research Y-MP systems, me IBM 3080, and other manufacturer's machines induing those of 
Amdahl, Hitachi, Fujitsu, and NEC 

Massively Parallel Processors (MPPs) are now thought of as capable of becoming supercomputers. 
These computer systems aggregate a large number of mioroprooeasord with an interconnection network 
ss and program them to operate in parallel. There have been two modes ol operation of ihese computers. 
Some of these machines have been mimd mode machines. 

Some of these machines have been SIMD mode machines. Perhaps me most commercially acclaimed 
of these machines has been the Connection Machines series 1 and 2 of Thinking Machines, Inc . These 
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have been essentially SIMD machines. Many of the massively parallel machines have used microprocessors 
Interconnected In parallel to obtain (heir concurrency or parallel operations capability* Intel mloroprooeseora 
fite i860 have been used by Intel and others. N Cube has made such machines wfih Intel '386 
microprocessors. Other machine haw been bunt with what is called the "transputer" chip, Inmos 

5 Transputer IMS TfiOO Is an example. The inmos Transputer T800 Is a 32 bit device wlih an Integral high 
speed floating point processor. 

As an example of tho kind of systems that are bum, several frames Transputer TOGO chips each would 
have 32 communication Unit Input? and 32 link outputs- Each cMp would have a singe processor, a small 
amount of memory, end communication links to the toed memory end to an external Interface. In addition, 

?o in order to build up the system communication Jink adaptors fik© IMS C011 and 0012 would be connected, 
in addition switches, like a IMS 0004 would provide, say, a crossbar switch between the 32 Rnfc inputs and 
32 link outputs to provide point-to-point connection between additional transputer chips. In addition, there 
w3l be special circuitry and interface chips tor transputers adapting them to be used lor a special purpose 
tailored to the requirements of a specific device, a graphics or disk controller. The Inmos IMS M212 Ls a 13 

rs bit processor, with on chip memory and communication finks, It contains hardware and logic to control disk 
drives and can be used as a programmable disk controller or as a general purpose interface. In order to use 
the concurrency (paialiei operations) Inmos developed a special language, Occam, for the transputer. 
Programmers have to descntre the network of transputers directly in an Occam program. 

Some or these massively parallel machines use parallel processor arrays of processor chips which are 

to imerconi»ected wish different topologies. The transputer provides a crossbar network with the addition of IMS 
C064 chips- dome other systems use a hypercube connection. Others use a bus or mesh to connect the 
microprocessors and there associated circuitry. Some have been interconnected by circuit switch proces- 
sors that use switches as processor addressable networks, Generally, as with me 14 RrSOSOOOs which 
were interconnected last fall at Lawrence Uvermore by wiring the machines together* the processor 

2s addressable networks have been considered as ooaise-prained multiprocessors. 

Some very large machines being built by Intel and nCube and others to attack what are called 
"grand chsienges" m data processing. However, these computers sre very expensive. Recent projected 
costs ere in the order d $30,000,000.00 to $76,000,000-00 (Tera Computer) for computers whose 
development has been funded by me U.S. Government to attack the "grand challenges". These "grand 

ao challenges " would include such problems as climate modeling* fluid turbulence, poiluiton dispersion, 
mapping of the human genome end ocean circulation, quantum chromodynamics, semiconductor and 
supercomputer modeling, combustion systems, vision and cognition. 

As a footnote to our background, we should recognise one of the early massively parallel machines 
developed by IBM. In our description m have chosen to use the term processor memory element rafter 

as than "transputer 11 to describe one of the eight or more memory units with processor and I/O capabilities 
which make up the array of PMEs in a chip, or node. The referenced prior art "transputer 1 * has on a chip 
one processor, a Fortran coprocessor, end a small memory, wrih an VO interface. Our processor memory 
etemenl could apply to a transputer and to the PME ot the RP3 generally. However, as will bo recognized, 
our ettle chip is significantly different in many respects. Our little chip has many features described later. 

4> However, we do recognize that the term PME was irrsf coined ior another, now more typical, PME which 
formed the basis for the massively parallel machine known as the RP& The IBM Research Parallel 
Processing Prototype <RP3) was an experimental parallel processor based on a Multiple instruction Multiple 
Data (MIMDj architecture. RP3 was designed and bum at IBM T.J. Watson Research Center in cooperation 
with ihe New York University Ultracomputer project. This work was sponsored in part by Defense Advanced 

<w .Research Project Agency. RP3 was comprised of 84 Processor-Memory Elements (PMEs) Interconnected 
by a high speed omega network- Each PME contained a 32-bit IBM *PC scientific* microprocessor, 32-kB 
cache, a 4-MB segment of the system memory k and an VO port. The P ME I/O port hardware and software 
supported initialization, status acquisition, as melj es memory and processor communication through shared 
VO support Processors (!8Ps). Each ISP supports eight processor- memory elements through the Extended 

so I/O adapters (ETlOs). Independent of Ihe system network Each ISP Interfaced to the IBM S/370 channel 
and the IBM Token-Ring network as well as providing operator mondor service. Each extended VO adapter 
attached as a device to a PME ROMP Storage Channel <RSC) and provided programmable PME 
control/status signal 1/0 via the ETIO channel. The ETIO channel Is the 32-bit bus which interconnected the 
ISP to the eight adapters. The ETK> channel reled on a custom interface protocol with was supported by 

55 hardware on the ETIO adapter and software on the ISP. 
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Problems addressed by our APAP machine 

Tba machine which wb have called fre Advanced Parallel Array Processor (APAP) is a fine-g^ned 
parallel processor which wo believe is needed to address issues of prior designs- As illustrated above, there 

9 have been many fine-grained (and also coarse-grained) processors constructed from both point design and 
ottthe-ahelf processor* using dedicated and shared memory and any one of me many possible intercom 
necsion schemes. To date these approaches have all encountered one or more design and performance 
limitations. Each "solution'" leads in a different direction. Each has Its problems. Existing parallel machines 
are difficult to program. Each is not generally adaptable to various stees of machines compattote across a 

10 range of applications. Each has its design limitations caused by physical design, interconnection and 
architectural issues. 

Physical issues 

js Some approaches utifae a separate chip design for each of the various functions required in a 
horizontal structure. These approaches suffer performance limitations due to chip crossing delays. 

Oihei approaches integrate- various functions together vertJcaSy No a single chip. These approaches 
suiter performance limitations due to the physical limit on the number of logic gates which can be 
integrated onto a producible chip. 

Interconnection Issues 

Networks which interconnect the various processing functions are important to fine-grained parallel 
processors, Processor designs with buses, meshes, and hypercubes have atl been developed. Each of 

2$ these networks has Inherent flmttettons as to processing capability. Buses limit both the number of 
processors which can be physically infercormecfed and the new/ark performance. Meshes lead to large 
network diameters which limit network performance. Hypercubes require each node to have a large number 
of interconnection penis; the number of processors which can be Interconnected Is limited by the physical 
input-output pins at the nods, Hypercubes are recognized as having some significant performance gains 

30 over Ihe prior bus and mesh structures. 

Architectural Issues: 

Processes which are suitable for fine-grained parallel processors fall into two distinct types. Processes 
3s which are functionally partraonable tend to perform better on multiple instruction, multiple data <MIMD) 
architectures* Processes which are not functionally parUUcnabfe but have multiple data streams tend to 
perform better on single instruction, multiple data (SIMO) architectures. For any given appscation, there is 
likely to be some number of both types of processes- System frade*of& are required to pick the 
architecture which best suits a particular application but no single solution has been satisfactory. 

40 

SUMMARY OF THE INVENTION 

We have created a n<*w way to make massively parallel processors and other computer systems by 
creating a new "chip" and syatemB designed with our new concepts. This application is directed to such 

w systems. Components described in ou* applications can be combined In our systems to make new 
systems. They also can be combined with existing technology. 

Think, our little CMOS DRAM chip of approximately 14 x 14 mm can be put together much like bricks 
are walled in a building or paved to form a brick road. Ow chip provides the structure necessary to build a 
"house", a complex computer system, by connected replication. 

so Racing our development in perspective, four little chips, each one alike, each one with eight or more 
processors embedded In memory with an internal array capability and eternal I/O broadcast and control 
Interfere, would provide me memory and processing power of thlny-sfcx or more complex computers, and 
they could all be placed with compact hybrid packaging into something the size of a watch, and operated 
with very low power, as each chip only dlsslpatss about 2 watis. With this chip, we have created many new 

ss concepts, and moso that we consider our invention are described in detail in the description and claims. 
The systems tnat can be created with our computer system can range from small devices to massive 
machines with PETAOP potential. 
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Our little memory ch?p array processor we call our Advanced Paralie* Array Processor, Though small ri 
to complex and powerful. A typical cluster will have many chfpe. 

Many aspect and features of invention have bean described in ibis and related application*. These 
concepts and features of invention improve and are applicable to computer systems which may noi employ 

s each invention. We believe our concepts end features Will be adopted and used In the next century. 

This fecfinical descriptor provides en overview of our Advanced Parallel Array Processor (APAP) 
representing our new memory concepts and our effort in developing a scalable massively parallel processor 
(MPP) that ts simple (very smell number of unique part numbed and has very high performance. Our 
processor utilizes in Its preferred embodiment a VLSI chip. The chip oomprtees 2n PME microcomputers, 

jo "n" represents the maximum number of array dimensionality. The chip further comprises a broadcast and 
control tntstface (BOO and Internal and externa* oommunfccetfor) pashs between PMEs on the chip among 
themselves and to the off chip system environment The preferred chip has d PMEs (but w» also can 
provide more) and one BOl The 2n PMEs and DC I are considered a node. This node can function in either 
SIMD or MIMD mode, in duel SIMG/MODE, wrth asynchronous processing, and with SIMIMD functionary. 

?5 Since it is scalable, this approach provides a node which can be the man building block for scalable 
paraBel processors of varying size. The microcomputer architecture of (he PME provides fully distributed 
message passing interconnection and control features within each node, or chip. Each node provides 
multiple parallel microcomputer capability at the chip level, the microprocesor or personal computer level, at 
a workstation level, at specie! application levels which may be represented by a vision and/or avionics level. 

20 and, when fully extended, to capability at greater levels with powerM QigafJop perfonnance into the 
supercomputer range. The simplicity is achieved by the use of a single highly extended DRAM Chip the! is 
replicated toio parallel clusters. This Keeps the part number count down and allows scaling capability to the 
cost or performance need, by varying me chip count, then the number of modules, etc. 

Our approach enables us to provide e machine with attributes meeting the requirements that drive to a 

29 paraiel solution In a series of applications. Our methods of paralleilzation at the sub-chip level serve to keep 
weight, volume, and recurring and logistic costs down. 

Because our different size systems are all based upon a single chip, software tools are common tor all 
sbe systems* This offers the potential of development soflwere (running on smaller workstation machines} 
that is interchangeable among ell levels (workstation, eemspece, and supercomputer). That advantage 

3d means programmers can develop programs on workstations while a production program runs on a much 
larger machine. 

As a result of our well balanced design implementation we meet today's requirements imposed by 
technology, performance, cosl end perception, and enable growth of the system Into the future. Since our 
MPP approach starts at me chip level, our discussion starts at the chip technology description and 

33 concludes wrih the supercomputer application descriptions. 

Physical, interconnection, and architectural issues will all be addressed in the machine cSrecUy. 
Functions will not only be integrated into a single chip design, but the chip design win provide functions 
sufficiently powerful and flexible Shai the chip will be effective at processing, routing, storage and three 
classes of I/O. The Interconnection network will be a new version of the hypercube which provides minimum 
network diameters without the inputtoutput pin and wireecilrty limitations normally associated wrih hyper- 
cubes. The trade-off between SiMD and MIMD are eliminated because the design alfows processors to 
dynamically switch between MIMD and SIMD mode. Tnis eliminates many problems which will be 
encountered by application programmers of "hybrid" machines, in addfflon, me design will snow a subset of 
the processors to be in SIMD or MIMD mode. 

<*s The Advanced Parallel Array Processor (APAP) is a fine-grained parallel processor. It consists of control 
and processing sections which are partrdonable such that configurations suitable for supercomputlng 
through personal computing applications can be aaitefiecL In most configurations it would attach to a host 
processor and support the off loading of segments of the lK>sf$ workload. Because the apap array 
processing elements are general purpose computers, the particular type of workload off-loaded will vary 

so depending upon the capabilities of the host For example, our APAP can be a module for an IBM 3090 
vector processor maintrame. When attached to a mainframe wrth high Derfonnsnoe vector floating point 
capability the task off-loaded might be sparse to dense matrix transformations. Alternatively, when attached 
to a PC persons! computer the off-loaded task might be numerically intensive 3 dimensional graphics 
processing, 

53 The above referenced parent USSN 07/51 'J ,584, filed November O r 1&&> of Dieffenderfer et al„ rilled 
'Parallel Aaooclotive Processor System'" describes the Idea of Integrating computer memory and control 
logic within a single chip and replicating the combination within the chip and building a processor system 
out of replications of the single chip. This approach which is continued and expanded here leads to s 
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system which provides massively parallel processing capability at the cost of developing and manutaduring 
only a single chip type white enhancing performance capability by reducing the chip boundary crossings 
and fine length. 

The above referenced parent US8N 07/611,504, filed November 13, illustrated utilization of i- 

3 dimensional I/O 8lructure3(essentlally a linear VO) wllh muitipie SlMD RMEs attached to that structure within 
a chip. This embodiment elaborates these concepts to dimensions Qreaier man 1. The description which 
follows will be m terms of 4-dim©nsk>nsl »>0 structures with ft SIMD-'MMD PMEs per chip. However, that 
can be extended to greater dimensionality or more PMEs per dimension as we will describe with respect to 
FIGURES 3,8, >0. 15 and 10, 

w Our processing element includes a full VO system including both data transfer and program Interrupts, 
Our description of our preferred embodiment will be primarily Ascribed in terms of the preferred 4- 
di meridional i-O structures with 8 SIMD/M1MD PMEs p*r chip, which has special advantages now m our 
view. However* thai can be extended to greater dfmensionamy or more PMEs per dimension as described 
In our parent application. In addition, for most applications we prefer and have made inventions in areas of 

js greater dimensions with hypercube interconnections, preferably with the modified hypercube we describe. 
However, in some applications a 2-dimenstonal mesh interconnection or chips will be applicable lo a task at 
hand For instance, in certain military computers a 2 dimensional mesh will be suitable and cost effective. 

This disclosure extends the concepts from the Irrtorproceesor communication to the external 
Input/Output facilities and describes the Interfaces and modules required for control of the processing array. 

so in summary three types of vo, inter-procesaor, processors to/from external, and b»oadcfisi ; controi are 
described. Massively parcel processing systems require all these types of )>X> bandwidth demands to be 
balanced with processor computing capability* Wlihln the array these requirements wil be satisfied by 
replicating a Id bit (reduced) instruction set processor, airmen tea with very fast interrupt state swapping 
capability. That processor is referred to as the PME illustrating the preferred embodiment ol our APAP- The 

2S characteristics of the PME are completely unique when compared with the processing elements on other 
massively parallel machines. It permits me processing, routing, storage and I/O to be completely distributed. 
This is not characteristic of any other design. 

In a hypercube each PME can address as its neighbor, any PME whose address differs In any single bit 
position. In a ring, any PME can address as as neighbor the two PMEs whose addressee differ 2 1, The 

so modified hypercube of our preferred embodiment utfiiied for the APAP combines these approaches by 
building hypercubes out of rings- The intersection of rings is denned to be a node. Each node of our 
preferred system has its PME, memory and l f O, and other features of the node, formed in a semiconductor 
silicon low level CMOS DRAM chip. Nodes are constructed torn multiple PMEs on each chip. Each PME 
exists in only one ring of nodes. PMEs within the node are connected by additional rings such ttet 

as communications can be routed beiween rings within the node. This leads to the addressing structure where 
any PME can step messages toward the objective by addressing a pme m lis own ring or an aoiacem pme 
wfthin the node. In essence a PME can address a PME whose address differs by 1 in one in the 1n?d bit 
field of its ring (where d is the number of PMEs in the ring) or the PME wilh the same address but existing 
in an adjacent dimension. Tne pme effectively appears to exist In n sets of rings, while in actuality It exists 

40 only in one reel ring and one hidden ring totally contained within the chip. The dimensionality for me 
modified hypercube is defined to be the value n from the previous sentence. 

We prefer to use a modified hypercube. This Is elaborated In the part of this application describing the 
technology. Finally, PMEs within a ring are paired such that one moves data externally clockwise along a 
ring of nodes and the other moves data externally counterclockwise along fie ring oT nodes, thus dedicating 

<s a pme to an external port. 

In our massively parallel machine, in our preferred smbodiment the interconnection and broadcast of 
data and Inslruollons from one PME to another PME In the node and externally of the node 10 outer nodes 
of a duster or PMEs of a massively parallel processing environment are performed by a programmable 
router, allowing reconfiguration and virtual flexibility to the network operations. This important feature is kilty 

so distributed and embedded In the PME and allows for processor communication and data transfers among 
PMEs during operations of the system in SfvtO end MiMO modes, as well as in the StMDrMlMD and 
SIMIfVIO modes of operation. 

Within (he rings each Interconnection leg is a pomhto-pofnt connection. Each PME has a peWo-polnt 
connection wrth the two neighboring PMEs in its ring and with two neighboring PMEs in two adjacent rings. 

55 Three of these point-to-point connections are internal to the node, while the fourth point-to-point connection 
is to an adjacent node. 

The massively parens) processing system uses the processing elements, with their local memory and 
interconnect topology to conned all processors to each other Embedded wMtfn the PME is our fully 
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distributed I/O programmable router. Our system also provktes an addition to the system which provides ihe 
abinty' to load and unload all the processing elements. With our zipper we provlda a method for loading and 
unloading of tl*e eiray of PEs and thus enable implementation of a fast I/O along an edge of the array's 
rings. To provide for external interf ace I/O any subset of the rings may be broken <un-zipped across some 

5 dimension^)) with tlie resultant broken paths connected to the external interface. The co-pending applica- 
tion entitled "APAP I/O ZIPPER", tiled concurrently herewith. USSN , filed May 22,1992. 
describes our 'zipper* in additional detail. The 'ripper* can be applied to only the subset of links required to 
support the peak external I/O load* which in all configurations considered so fat leads to its being applied 
only to one or two edges of the physical design, 

io The final type of I/O consists of data that must be broadcast to, or gathered from all PMEs, plus date 
which Is too specialized to fit on the standard buses. Broadcast data includes commands* programs and 
data. Gathered data is primarily status and monitor functions whPe diagnostic and test functions are the 
specialized elements. Each node, in addition to the included sat of PMEs. contains one Broadcast and 
Control Interface (BCI) section. 

f5 Consider PMEs interconnected in a modHied 4 dimensional hypvrcube network. If each ring contains 16 
PMEs, (hen the system wia have 32,768 PMEs. The network diameter Is 18 steps. Each PME contains In 
SrW the router and reconfiguratkm S/W to support a particular outgoing port Tnus, software routing 
provides the capability to reconfigure fn the event of a faulty processing element or node. Inherent in a 4d, 
25 Mh2 network design with byte wide half duplex rings Is the provision for 410 gigabytes per second peak 

so internal bandwidth. 

The 4 dimensional hypercube leads to a particularly advantageous package. Eight of the PMEs 
(including data flow, memory and I/O paths and controls) are encompassed in a single chip. Thus, a node 
will be a single chip including pairs of elements along the rings. The nodes are configured together in an 8 
X 8 array to make up a cluster. The fully populated machine is built up of an array of Q X 8 clusters to 

29 provide the maximum capacity of 32 JG8 PMEs. 

Each PME is a powerful microcomputer having significant memory and I/O functions. There is muitibyte 
data flow wiihm a reduced instruction set (RISC) architecture Each PME has 16 bit internal data flow and 
eight levels of program interrupts wtth the use of working and general registers to manage date flow. There 
is a circuit enriched and store and forward mode for I/O transfer under PME soitware control. The SIMD 

30 mode or MIMD mode Is under PME software control. The PME can execute RISC Instructions from either 
the BGI in a SIMD mode, or from ns own main memory in MIMD mode. Specific RISC instruction code 
points can be reinterpreted to perform unique functions in the SIMD mode. Each PME can implement an 
extended Instruction Set Architecture and provide routings which perform macro level Instructions such as 
extended precision fixed point arithmetic, Boating point arithmetic, vector arcfrmetlc, and the like- This 

as permits not only complex math to be handled but image processing activities for display of image data in 
multiple dimensions <2d and 3d Images) and lor multimedia applications. The system can select groups of 
PMEs for a function. PMES assigned can allocate selected data and instructions for group processing- The 
operations can be externally monitored via the 801. Each BCl has a primary control input, a secondary 
control Input end a status monitor output lor the node. Within a node the 2n PMEs can be connection for a 

<o binary hypercube ccmmutticsraoh nefwoifc within the chip. Communication between PMEs is controlled by 
the bite In PME control registers under control of PME software. This permits (he system to have a virtual 
routing capabfflty. Each PME can step messages up or down Its own right or to Its neighboring PME in 
aimer of two adjacent rings. Each interface between PMEs Is a point-to-point connection. The v'O ports 
permit off-chip extensions of Ihe internal ring to adjacent nodes of the system. The system is bufll up of 

*$ replications of a node to form a node array, a cluster, end other configurations. 

To complement our system's SIMD, MiMD, SlMD/MlMD and SIMIMD functionality, our development we 
have provided unique operational modes. Among our SIMD/MIMD PME's unique modes are Uie new 
functional features referred to as the "store and forward / circuit swscrV functions. These hardware functions 
complemented with the on chip communication and programmable internal and external 1/0 routing provides 

50 Hie PIVE wKh very optimal data transferring capability. In preferred mode of operation the processor 
memory is generally the data sink for messages and data targeted at the PME in the store and forward 
mode. Messages and data not targeted for the PME are sent directly to the required output port when In 
circuit switched mode. The PME software performs the selected routing path while giving the PME a 
dynamically selectable store and forward / circuit switch functionality, 

» Among ihe advances we have provided is a fully distributed architecture for PMEs of a node. Each 
node has 2n processors, memory and I/O. Every PME will provide very flexible processing capability with 
16 bit data How, 64K bytes of local storage, store and forwartoteuri switch logic, PME to PME 
communication. SIMD/MIMD switching capabilities, programmable routing, and dedicated floating point 
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assist logic. The organ lotion ol every PME and its communfcaticn paths with other PMEs within the same 
chip to minimize chip crossing delays. PME functions can be Independently operated by the PME and 
integrated with functions in fte node, a cluster, and large* arrays. 

Our massively parallel system is made up of nodal building blocks of multiprocessor nodes* clusters of 

a nodes, and arrays of PMEs already packaged In clusters. For control of these packaged systems whe 
provide a system array director which wrih the hardware controllers performs the overall Processing 
Memory Element fPMEj Array Controller functions in the massively parallel processing environment The 
Dhecsor comprises of three functional areas* the Application Interface, the Cluster Synchronizer, end 
normally a Cluster Controller. The Array Director will have the overall control of the PME array, using the 

70 broadcast bus and our zipper connection to steer data and commands to ail of the PMEs. The Array 
Dhecfcr function 9 as a software system interacting wtfh the hardware to perform the role as the shell of the 
APAP operating system. 

The interconnection for our PMEs for a massively parallel array computer SIMD/MlfclD processing 
memory element (PMEy interconnection provides the processor to processor connection in the massively 

;s parallel processing environment Each PME ualizes our fully distributed irfierproceesor communication 
hardware from the on-chip PME to PME connection, to the off-chip I/O facilities which support the chip-to- 
chip interconnection. Our modified topology limits our cluster to cluster wiring while supporting the 
advantages of hypercube connections. 

The concepts which we employ for a PME node am related to the VLSI packaging techniques used for 

20 the Advanced Parallel Array Processor (APAP) computer system disclosed here, which packaging features 
of our Invention provide enhancements to the manufacturing ability of the APAP system. These techniques 
are unique in the area of massively parallel processor machines and will enable the machine to be 
packaged and configured in optimal subsets that can oe built and tested. 

The packaging techniques take advantage of iho eight PMEs packaged In a single chip and arranged in 

so a N-dlmensional modified hypercube configuration. Tnis chip level package or node of the array is the 
smallest building block in the APAP design. These nodes are then packaged in an 6 x 3 array where the +- 
X and the +-Y makes rings within the array or cluster and the *-W> and +-Z are brought out bo the 
neighboring clusters, A grouping of clusters make up an array. Toe intended applications tor APAP 
computers depend upon the pedicular configuration and host Large systems attached to mainframes with 

so effective vectorized floating point processors might address special vectorizable problems - such as weather 
prediction, wind tunnel simulation, turbulent fluid modeling and finSe element modeling. Wlwre these 
problams involve sparse matrices, significant work must be done to prepare the data for vectorized 
arithmetic and likewise to store results. Thai workload would be off loaded to the APAP. In Intermediate size 
systems, the APAP might be dedicated to perfonning the graphics operations associated wHh visualization* 

as or with some preprocessing operation on incoming data (to., performing optimum assignment problems in 
military sensor fusion applications). Small systems attached to workstations or PCs might serve as 
programmer development stations or might emulate a vectorized floating point processor attachment or a 
3d graphics processor. 



40 BRIEF DESCRIPTION OF THE DRAWINGS 

HO. 1 shows a parallel processor processing element like those which would utilize old technology. 

FIG- 2 shows a massively parallel processor building block fn accordance wtm our invention, 
representing our new chip design. 
4t> fig. 3 illustrates on the right side the preferred chip physical cluster layout for our preferred 
embodiment of a chip single node fine grained parallel processor. There each chip is a 
scalable parallel processor chip providing 5 MIPe performance with CMOS DRAM memory 
and logic permitting air cooled implementation of massive concurrent systems. On the left 
ride of Figure 3, there is illustrated the replaced technology. 
30 FIG. 4 shows a computer processor functional block diagram In accordance with the invention. 

FIG. S shows a typical Advanced Parallel Array Processor copnpufer system configuration. 

FIG. 6 shows a system overview of our fine-grained parallel processor technology In accordance 
with our invention, illustrating system build up using replication of the PME element which 
permits systems to be developed with 40 to 193,840 MPS performance, 
55 FIG. 7 illustrates ihe hardware for the processing element (PME) data flow and local memory in 
accordance with our invention. whDe 

FIG. & illustrates PME data flow where a processor memory element is configured as a hardwired 
general purpose computer that provides about 5 MIPS fixed point processing or A Mflops via 
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programmed control floating point operations. 
FIG. 9 shows the PME to PME connection (binary hypercute) find data paths thai can be taken in 

accordance with om invention while 
FIG. 10 illustrates node irrterccivtecfcons for to© chip or nod© which has 8 PMEs, each of which 
s maneges a single external port end permits cftstribulion of Uie network control function and 

eliminates a functional hardware port betiteneck. 
FIG. IP I? a Week diagram of a scalable parallel processor chip where each PME is a 16 bit wide 

processor Yvtth 32 K words of local memory and there is I/O porting for a broadcast port 

which provides a comrolter-to-sll Interface while external ports are bWirecflonal polrrMo-pcirtf 
70 interfaces permitting ring torus connections within the chip and externally. 

FIG. 12 shows an array director in the preferred embodiment. 

FIG. 13 in part (a) illustrates roe system bus to or from a cluster array coupling enabling loading or 
unloading of the array by connecting the edges of dusters to the system bus (see FIGURE 
14). In FIGURE 13 in part (f>) there is the bus to/from the processing element portion. 
>s FIGURE 13 illustrates how multiple system buses can be supported with multiple clusters. 

Each cluster can support 50 to 57 Mbyte/a bandwidth* 

FIG. 14 shows a'zippei " cennection for fast UQ connection. 

FIG. id shows an 0 degree hypercube connection illustrating a packaging technique In accordance 
wltfi our invention applicable to an 8 degree hypercube. 

20 FIG. id shows two independent node connections in fre hypeicubs. 

FIG. 17 shows the Bitonic Sort algorithm as an example to illustrate the advantages of the defined 

StMD/MiMD processor system. 
FIG. id ilustrates a system bloc* diagram for a host attached le*ge system with one application 
processor interlace iBustraled. This illustration may also be viewed with ihe understanding 

23 that our invention may be employed In stand atone systems which use multiple application 

processor interfaces. Such interfaces in a FIGURE 18 eonfiguraaon rn\\ support 
DASD&raphics on all or many clusters. Workstation accelerates can eliminate the host, 
application processor interface (API) and duster synchronizer (CS) illustrated by emulation. 
The CS is not required in aa instancss. 

aj FIG. 19 Illustrates me software development environment lor our system. Programs can be prepared 
by and executed from ihe Host application processor. Both program end machine debug is 
supported by the workstation based console illustrated here and in FIGURE 22- Both of these 
services will support applications operating on a real or a simulated MM P. enabling applica- 
tions to be developed at a workstation level as well as on a supercomputer formed of tlie 

as APAP MMP. The common software environment enhances programmability and distributed 

usage. 

FIG. 20 illustrates the programming levels which are permRtad by ihe new systems. As deferent 
users require more or tees detailed knowledge, the software system is developed (o support 
this variation. At the highest level the user does not need to know the architecture is Indeed 
<*> an MMP, The system can be used with existing language systems for partitioning of 

programs, such as parallel Forlran- 

FIG. 21 iBustrates the parallel Fortran compiler system for the MMP provided by the APAP configura- 
tions described. A sequential to parallel compiler system uses a combination of existing 
compiler capability with new data allocation functions and enables use of a parb'tioning 
45 program r&e FortranD, 

FIG. 22 Illustrates the workstation application of the APAP, where the APAP becomes a workstation 
accelerator. Nole that the unit has the same physical sice aa a RISC/8000 Model 530, bul 
this model now contains an ivfyiP which is attached to the 'workstation via a bus extension 
module illustrated, 

so FIG. 23 Illustrates an application for an APAP MMP module for on AWACS military or commercial 
application. This is a way of handling efficiently the classical distributed sensor fusion 
problem shown In FIGURE 23, where the observation to track matching is classically done 
with well know algorithms like nearest neighbor* 2 dimensional linear assignment (Munkes-). 
probabilistic data association or multiple hypothesis tasting, but these can now be dona in an 

as improved manner as illustrated by FIGURES 24 and 25, 

FIG. 24 illustrates how the system pio^des the ability to handle n-dlraensional assignment problems 
In real time, 

FIG. 25 illustrates processing flow for an rxlmensional assignment problem utilizing an APAP. 
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FIG. 28 illustrates the expansion una provided by the system enclosure described shewing how a 
unit can provide 424 MfbpS or 5120 MIPS using only 8 to 10 extended SEM-E modules, 
providing the performance comparable to that of specialized signal processor module in only 
,6 cubic feet Thtt system can become a SIMD massive machine with 1024 parallel 

9 processors performing two billion operations per second (6 OPS) and can grow by adding 
1024 addraonal processors end 32MB additional storage. 

FIG. 27 Illustrates the APAP packaging for a supercomputer. Here is a large system of comparable 
performance but much smeller footprint than other systems. It can bo built by replicating the 
APAP cluster wttiln an enclosure Km those used for smaller machines. 

10 Vte have provided, as part of (he description, Tables llustrating the hardwired instructions for a PME, rn 
which Table 1 illustrates Fixed-point arithmetic instructions: Table 2 illustrates storage to storage inatruc- 
ifons; Table 3 illustrates logical irrstruclions; Table 4 illustrates soft Instruclions; Tabse 5 illustrates branch 
instructions: Table 6 illustrates the statue switching instructions; and Table 7 illustrates the input/output 
Instructions. 

is (Note: For convenience of illustration In ths formal patent drawings, FIGURES may be separated in parts 
and as a convention we place the top of the FIGURE as the first sheet with subsequent sheets proceeding 
down and across when viewing the FIGURE, in the event that muttiple sheets are used) 

Our detailed description follows with parte explaining the preferred embodiments of our invention 
provided by way of example. 

so 

DETAILED DESCRIPTION OF THE INVENTION 

Turning now to our invenflon in greater detail, it will be seen from FIGURE I, which illustrates me 
existing technology* level, illustrated by the transputer T000 chip, and representing similar chips lor such 

£$ machines as the illustrated by the Touchstone Delta (l6eo> N Cube ('385). and others. When FIGURE 1 Is 
compared with the developments here, it win be seen thai not only can systems like the prior systems be 
substantially improved by employing our invention, bui also new powerful systems can be created, as we 
will describe. FIGURE i*s conventional modern microprocessor technology consumes pins and memory. 
Bandwidth is limited and inter-chip communicaDon drags the system down. 

so The new technology leapfrog represented by FIGURE 2 merges processors, memory I/O into multiple 
PME$ (eight or mo?e 18 bit processors each of which has no memory access delays and uses all the pins 
ior networking) formed on a single low power CMOS DRAM chip. The system can make use of ideas of our 
prior referenced disclosures as vteD as Invention separately described In the applications tiled concurrently 
herewith and applicable to the system we describe here. Thus, for this purpose they are iiKorporated herein 

99 by reference. Our concepts of grouping, autonomy, transparency, zipper interaction, asynchronous 3IMD, 
SiMlMD or SMQ/IVIIMD. can all be employed with me new technology, even though to lesser advantage 
they can be enriployed in the systems of the prior technology and in combination with our own prior multiple 
picket processor. 

Our picket system can employ the present processor. Our basic concept Is tf>at we have now provided 
40 a rentable brick, a new basic building block for systems wfth our new memory processor, a memory unit 
having embedded processors, router and 1*0. This bask: building block is scalable. The basic system which 
we have implemented employs a 4 Meg. CMOS DRAM. It is expandable to be used in larger memory 
configurations, with i6Mbit DRAMS, and e4Mbit chips by expansion. Each processor is a gate array. With 
denser deposition, many more processors, ai higher dock speeds, can be placed on the same chip, and 
using gates and additional memory wis expand the performance of each pme. Scaling a single part type 
provides a system framwurk and architecture which can have a performance wefl into the PETAOP range. 

RGURE 2 Illustrates the memory processor which we call the PME or processor memory element in 
accordance with our preferred embodiment. The processor has eight or more processors. In the pictured 
embodiment there are eight The chip can bo expanded (horizontally) to add more processors, The chip 
so can, as preferred, retain Uie logic and expand the DRAM memory with additional ce3s linearly (vertically). 
Pictured are 16 - 32k by 9 bit sections of DRAM memory surrounding a field of CMOS gate array gates 
which Implement d replications of a id bit wide data flow processors. 

Using IBM CMOS low power sub-micron IBM CMOS deposition on silicon technology. It uses selected 
silicon with trench to provide significant storage on a small chip surface. Our memory and multiple 
59 processors organized interconnect is made with i&M*s advanced art of making semiconductor chips. 
However, rt will be recognized that the iRtie chip we describe has about 4 Meg. memory, it is designed so 
that as 16 Meg. memory technology becomes stable, when improved yields and methods of accommodat- 
ing defects are certain, our little chip oan migrate to larger memory sizes each 0 bits wide without changing 
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the logic. Advances m photo and X-ray ifthography keep pushing minimum feature ske to well below 3 
microns. Our design erMstons mora progress. These advances will permit placement of very large amounts 
of memory with processing on a single silicon chip. 

Our device is a 4 MEG CMOS DRAM believed to be the first general memory chip vriih extensive room 

s lor logic. 18 replications of a 32k by 9-bit DRAM macro make up the memory array. The DRAM has 120K 
cede it allocates with significant surface area for application topta on the chip, with triple level metal wiring. 
Ths processor tagto cells are preferably gate array cells. The 35 ns or less ORAM access time marches the 
processor cycle time. This CMOS implementation provides logic density for a very effective PE (picket) end 
doee so while diaslpellng 1.3 watte for the logic. The separate memory section of the chip, each 32K by 9 

jo bits. (wilh expansion not changing logic) surrounds the field of CMOS gate array gates representing 120K 
cells, and having the logic described to other figures- Memory Is barriered and with a separated power 
source dissipates .9 waits. In providing the combining of signfficant amounts of logic on the same silicon 
substrate win significant amounts of memory problems involved vriih the electrical noise incompatibility oi 
logic and DRAM have been overcome. Logic tends to be very noisy while memory needs relative quiet to 

« sense the millivolt size signato that result from reading the cefls of DRAM. W© prefer to provide trenched 
triple metal layer silicon deposition, with separate barriered portions of the memory chip devoted to memory 
and lo processor logic with voltage and ground isotenon, and separate power distribution and barriers, to 
achieve compatibility between logic and DRAM. 

ao apap System Overview of Prefewgd Embodiments 

This description introduces the new technology in the following order: 

1, Technology 

2, Chip H/vV description 

23 3- Networking end system build up 

4. Software 

5. Applications 

The initial sections of the detailed description describe how 4-Meg DRAM low power CMOS chips are made 
to include & processors on and as part of tne manufactured PME DRAM chips each supporting; 
a> 1 . 16 bit 3 MIP dataflows, 

2. independent instruction stream and interrupt processing and 

3, S bit (plus parity and controls) wide external port end interconnection to 3 other on chip processors. 
Our invention providea multiple functions which are Integrand into a single chip design. The chip will 

provide PME ^notions which are powerful and flexibb *nd sufficiency so such that a chip having stability 
as will be effective at processing, routing, storage and three classes of I/O. This chip has integrated memory 

and control logic within the single chip to make the PME, and this combination is replicated within me chip. 

A processor system is built from replications of me single chip. 

The approach partitions the low power CMOS DRAM, il will be formed as multiple word length <18) bit 

by 32K setfons. associating one secflon with a processor. (We use the term PME to refer to a single 
*o processor, memory and I/O capable system unit-) This pamioning leads to each ORAM chip being an 8 

way 'cube connected' MIMO parallel processor with & byte wide independent interconnection ports. (See 

FIGURE e for an illustration of a replication of fine-grained parallel technology. Illustrating replication and the 

ring torus possibilities.) 

The software description addresses several distinct program types. Ai the lowest level processes 
45 interface the user's program (or services celled by the application) io the detailed hardware haw needs. 
This level includes the tasks required to manage the t'O and interprocessor synchronization and is what 
might be called a microprogram for me MPP. An intermediate level of services provide for both mapping 
applications (developed with vector or matrix operations) to the MPP, and also control, synchronization, 
startup, diagnostic functions. At the host level, high order languages ere supported by library functions that 
so support vectorized programs with either simple automatic data allocation to the MPP or user tuned data 
aflocatton. The mutti-isvel software SA/V approach permits applications to exptoil different degrees of control 
and optimization within a single program. Thus, a user can code application programs without understanding 
the architecture detail while an optimizer might tune at the microcode level only the small high usage 
kernels of a program. 

69 Sections of our description that describe t0£4 element 6 GIPS units and a 35,76a element 164- OiP8 
unit illustrate the range of possible systems. However, those are not the limits; both smaller and larger units 
are feasible, These particular sizes have been selected as examples because the small unit is suitable to 
microprocessors (accelerators), personal computers, workstation and military applications (using of coarse 
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different packaging techniques), while the larger unit is illustrative of a mainframe application as a module or 
complete supercomputer system. A software description will provide examples of other challenging work 
that might be et^criveiy programmed on each of the illustrative systems. 

5 PME DRAM CMOS ■ A BASE FOR A MULTIPROCESSOR PME 

FIGURE 2 illustrates our technology Improvement at in© chip technology level. This exrandabte 
computer organization la very cost and performance efficient over the wide range of system sizes because 
it uses only one chip type. Combining me memory and processing on one chip eliminates the pine 

w dedicated to the memory bus and their associated reliability and performance penalties. Replication of our 
design within the chip makes it economically feasible to consider custom logic designs for processor 
subsections- Replication of the chip within the system leads to large scale manufacturing economies. 
Finally, CMOS technology reQuires low power per MIP. which in turn rmntrrdaes power supply and cooling 
needs. The chip architecture can be programmed for multiple word lengths enabling operations to be 

;s performed that would otherwise require much larger length processors. In com bin soon these attributes 
permit the extensive range of system performance. 

Our new technology can be compared with a possible extension of the old technology it overlaps. It is 
apparent that the advantages of smaller features have been used by processor designers to construct more 
complex chips and by memory designers (o provide greater replication of the simple element. If (he trend 

20 continues one could expect memories to get four times $3 large -while processors might exploit density to: 

1. include multiple execute units with instruction routers, 

2. Increase cache sizes and associative capability anovor 

3. Increase instruction look ahead and advance computation capability. 

However, these approaches to Ihe old technology illustrated by FIGURE 1 all tend to dead end. 

ss Duplicating processors leads to linearly Increasing pin requirements bus pins per chip is fixed- Better cache- 
mg can only exploit the application's data reuse pattern. Beyond thai, memory bandwidth becomes ihe lima. 
Application data dependencies and branching limit ihe potential advantage of look ahead schemes. 
Additionally, It is not apparent that MPP applications with fine-grained pa^telisrn need 1« 4. or 16 
Megaword memories per processing unit Attempting to share such large memories between multiple 

30 processors results In severe memory bandwidth limitations. 

Our new approach is not dead ended. We combine boih significant memory and I/O and processor into 
a single chip, as illustrated by the FIGURE 2 and subsequent illustration and description. It reduces pert 
number requirements and eliminates the delays associated vw-ith chip crossing. Mere importantly, this 
permits all the chip T s VO pins to be defeated to iitierproceasor communication and thus> maximizes 

38 network bandwidth. 

To implement our preferred embodiment illustrated in FIGURE 2 we use a process ihei is available now, 
usfng IBM low power CMOS technology. Our illustrated embodiment can be made writ* CMOS DRAM 
density, in CM08 and can bo implemented in denser CMOS. Our illustrated embodiment of 32K memory 
cells for each of 3 PHEs on a chip can be Increased as CMOS becomes denser. In our embodiment we 

40 utilize the feel estate end process technology for a 4 MEG CMOS ORAM, and expand this with processor 
replication associated with 32K memory on ihe chip itself. The chip, ft wi£ be seen has processor, memory, 
and I/O In each of the chip packages of the cluster shown In FIGURE a Within each package is a memory 
with embedded processor sterner*, router, and I/O. all contained In a 4 ME<3 CMOS DRAM believed to be 
the first general memory chip wilh extensive room for logic. U uses selected silicon wtlh trench to provide 

4$ significant storage on a small chip surface. Each processor chip of our design alternatively can be made 
w&h 1 6 replications of a 32K by d bit ORAM macro (35/30 ns) using Si? micron CMOS logic to make up the 
memory array. The device Is unique ?r» that H allocates surface area for 120 K cells of application logic on 
the chip, supported by the capebitty of triple level metal wiring. The multiple cards of the old technology is 
shown crossed out on the left side of FIGURE 3, 

so Our basic repltcable element brick technology Is an answer to the del technology. IT one considered the 
"Xed" technology on the left of FIGURE 3. one would see too many chips, too many cards, and waste. For 
example, today's proposed teraflop machines thai others offer would have literally a minion or more chips m 
them. With todays other technology only a few percent of these chips, at best are truly operations 
producers. The rest are "overhead" (typicaiiy memory, rteiwort interface, eta). 

50 It will become evident that it is not forcible to package ouch chips, in such a large number, in anything 
that must operate in a constrained environment of physical stee» (How many could you fit in a smai area of 
a cockpit?) Furthermore, such proposed teraflop machines of others, already huge, must scale up tOOOx 
times to reach the petaop range. We have a solution which dramatically decreases the percent of non- 
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operations producing chips. We provide Increased bandwidth. We provide this within a reasonable network 
dimensionality. With such a brick technology, where memory becomes the operator, and networks are ueed 
lor passing control*, where operations producing chips are dramatically increased* In addition, the upgrade 
dramatically reduces the number of different types of chips. Our system is designed for scale- up, without a 

s requirement for specialized packaging, coding, power, or environmental constraints. 

With our brick technology, uafeing instead of separate processors, memory unite with built in 
processors and network capability, me configuration shown in FIGURE 3, representing a card, wHh chips 
which are pin computable with current 4Mblt DRAM card* at the connector level Such a single card could 
hold, with a design point of a baste 40 mip per chip perforrftance level, 32 chip*, or 1280 mips. Four such 

70 cards would provide 5 gips. The -workstation configuration which k illustrated woupd preferably have such a 
PE memory array, a cluster controller, and an IBM RISC System/ecOQ which has sufficient performance to 
run and monitor execution of an array processor application dsvsbped at the wortst&iion- 

A very gate efficient processor can be ueed in the processor portion. Such designs tar processors have 
been employed, but never within memory, indeed, in addition, we have provided the ability to mix MlMD 

re and SIMD bask operation provisions. Our chip provides a "broadcast bus" which provides an alternate path 
into each CPU's instruction buffer. Our cluster controller Issues commands to each of the PEs in the PMEs, 
and these can be stored in the PME to control their operation in one mode or another. Each PME does not 
have to store an entire program, but can store only those portions applicable to a given task a) various 
times during processing of an application. 

20 Given the basic device one can elect to develop a single processor memory combination. Alternatively, 
by using a more simple processor and a subset of the memory macros one can design for either 2, 4-, 6 or 
16 replications of (he basic processing element (PME). The PME can be made simpler either by adjusting 
the dataflow bandwidth or by substring processor cycles for functional accelerators. Tor most embodi- 
ments we prefer to make 8 replications of the bosk; processing element we describe; 

s» Our application studies have indicated that for now the moot favorable answer la 3 replications of a 10 
bit wide data flow and 32K word memory. We conclude this because: 
1 . 16 bit words permit single cycie feich of instructions and addresses. 

Z 8 PMEs each with an external port permits 4 dimensional torus Interconnections. Using 4 o« 6 PMEs 
on each ring leads to modules suitable tor the range of targeted system performances, 
30 3. 8 external ports requires about 50% of toe chip pine, providing sufficient remainder tor power, ground 
and common control signals. 

4, 8 Processors implemented in a 64 KByte Mein Store 

au allows for a register based architecture rather than a memory mopped architecture,, and it 
b> forces some desirable but not required accelerators to be implemented by murHpie processor 
39 cydes. 

This last attribute is important because it permits use of the developing logte density increase. Our new 
accelerators (ex. floating point arithmetic unit per PME) ere added as cmp hardware without affecting 
system design, pins and cables or application c ode- 
Trie resultant chip layout and size (14.59 x 14.33 mm) Is shown in FIGURE 2, and FIGURE 3 shows a 
cluster or such chips, which can be packaged in systems like those shown in later FIGURES lor stand alone 
units, workstations which slide next to a workstation host with a connection bus, in AWACs applications, and 
in supercomputers. This chip technology provides a number of system level advantages- It permits 
development of the scalable MPP by basic replication of a single part type. The two DRAM macros per 
processor provide sufficient storage for both data and program. An SRAM of equivalent si*e might consume 

<w more than to times more power. This advantage permits mimd machine models rather than the more 
limited SIMD models characteristic of machines with single chip processor/memory designs. The 35 ns or 
less DRAM access time matches ihe expected processor cycle time. CMOS logto provides the logic density 
for a very effective PME and does so while dissipating only 1.3 watts. (Total chip power is 1.3 + .9 
(memory; * ZZ w,) Thoso features fn turn permit using the chip in MIL applications requiring conduction 

so cooling. (Air cooling in nen-MIL applications is significantly easier.) However, the air cooled embodiment can 
be used for workstation and other environmsnta. A standalone processor might be configured wHh an 80 
amp - 5 volt power supply. 

Advanced Parallel Array Processor <APAP) building Weeks are shown in FIGURE 4 and In FIGURE 5. 
FIGURE 4 illustrates me functional WocK diagram of me Advanced Parallel Array Processor. Multiple 

ss application interfaces 150, 160, 170, 180 exist for the application processor 100 or processors 1 10. 120, 
130. FIGURE 5 illustrates the basic building blocks that can be configured Into different system block 
diagrams. The AFAP, in a maximum configuration, can incorporate 32,768 identical PMEs. The processor 
consists of the PME Array 280. 290. 300. 310. an Array Director 250 and an Application Processor Interface 
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260 for Ihe appliczraon processor 20Q or processors 210, 220. 230. The Array Director 250 consists of three 
functional unite: Application Processor Interface 260 k cluster Synchronizer 270 and duster Conner 270. 
An Array Director can perform the functions of the array controller or our prior linear picket system for SIMD 
operations with MIMD capability. The cluster corrirdter 270, along with a sot of 64 Array dusters 200, 290, 

5 300. 310. tie. cluster or 512 PMEs}. is (he basic building block or the APAP computer system. The 
Omenta of the Array Director 250 permit configuring systems wfih a wide range of cluster replications. This 
modularity based upon strict replication of both processing and control elements \s unique to this massively 
parallel computer system. In addition, the Application Processor Interface 260 supports the Test/Debug 
device 240 which will accomplish important design, debug, and monitoring functions. 

m Controllers are assembled with a walf-datined interlace, e.g. IGMs MicroChannel, used in other systems 
today, including controllers wrjh 1660 processors. Field programmable gate arrays add functions to the 
controller which can be changed to meet a particular configuration's requirements (how many PMEs there 
are. their couplings, etc.) 

The PME arrays 280, 290, 300» 310 contain the functions needed to operate as either SIMD or MIMD 
S3 devices. They also contain functions that permit the complete set of PMEs to be divided into 1 to 258 
distinct subsets. When divided Into subsets the Array Director 250 Interleaves between subsets. The 
sequence of the interleave process and the amount of control exercised over each subset is program 
controlled. This capability to operate distinct subsets of the array in one mode, I.e., MIMD with differing 
programs, while oflier sets operate In tightly synchronised SIMD mode under Array Director control* 
30 represent* an advance in the art. Several examples presented later illustrate the advantaged of the concept. 

Array Architecture 

The set of nodes forming the Array is connected as a n-dimensional modified hy per cube. In that 
£S interconnection scheme, each node has direct connections to 2n other nodes. Those connections can be 
either simplex, halt-duplex or fun-duplex type paths. In any drftension greater than 3d, the modified 
hypercube Is a new concept In Interconnection techniques (The modified hypercube in the 2d case 
generates a torus, and *m the 3d case an orthogonal^ connected lattice wHh edge surfaces wrapped to 
opposing surface,) 

so To describe tha interconnection scheme for greater than 3d cases requires an inductive description. A 
set of mi nodes can be interconnected as a ring- (The ring could be 'simply connected 1 , 'braided*, 'cross 
connected', 'fully connected*, eic. Although additional node ports are needed for greater than simpfe rings, 
that added complexity does not affect the modified hypercube stnxciure.) The rrta rings can (hen be linked 
together by connecting each equivalent node In the mi set of rings. The result at this point te a torus. To 

ss construct a i+ Id modified hypercube from an id modified hypercube, m» i sets of id modified hypercubes 
and interconnect all of the equivalent m s level nodes into rings. 

This process is frustrated for the 4d modified hypercube, using mj = 8 for i - i ~4 by the illustration m 
FIGURE & Compare our description under node Topology and also FIGURES 6, 0, 10, 15 and 10. 

FIGURE 6 illustrates the fine-grained parallel technology path from the single processor element 300, 

40 made up of 32K id-bit words with a 16-bit processor to the Network node 310 eight processors 312 and 
their associated memory 31 1 with their fully distributed I/O routers 313 and Signal I/O ports 314, 315, on 
through groups of nodes labeled dusters 320 and into the cluster configuration 3tt) and to the various 
applications 330, 340, 350, 370. The 2d level structure is the cluster 320, and &4 clusters are integrated to 
form the 4d modified hypercube of 32,708 Processing Elements 360. 

<$ 

Processing Array Element (PME) Preferred Embodiment 

As IBustiated by FIGURE 2 and FIGURE 11 the preferred APAP has a basic building block of a one 
chip node. Each node contains 8 identical processor memory elements (PMEs) end one broadcast and 

so control interface (BCl). While some of our inventions may be Implemented when all functions are not on the 
same chip, it is important from a performance and cost reduction standpoint to provide the chip as a one 
chip node whh me 6 processor memory elements using the advanced technology which wo have described 
and can be Implemented today. 

The preferred implementation of a PME has a 04 KByte main store. 16 16-blf general registers on each 

ss of 0 program interrupt levels, a full function arithmetic/logic unit (ALU) with working registers, a status 
register, and four programmable bi-directional to ports, in addition the preferred implementation provides a 
SIMD mode broadcast interface via the broadcast and control interface <8CI) which allows an external 
controller (see our original parent application and the description of cur currently preferred embodiment for 
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a nodal array and system win dusters) to drive PME operation decode, memory address, and ALU data 
Inputs. This chip con perform (he functions ol a microcomputer allowing multiple parcel operations to be 
performed wfihin it, and it can be couplad to other chip* within a system of multiple nodes, whether by an 
mrorconnection network a mosh or hypercube network, or our preferred and advanced scalable embodi- 
s merit 

The PMEs are interconnected in a series of rings or tori in our preferred scalable embodiment. In some 
applications the nodes coukJ be interconnected in a mesh. In our preferred embodiment each node contains 
two PMEs In each of four tori. The tori are denoted WW and Z (see FIGURE e). FIGURE 11 depicts the 
Interconnection of PMEs within a node. The two PMEs In each torus are Obsignated by their externa) K> 
10 port j + W, -W, + x\ -X, +y. -Y r +Z. -Z). Within ihe node, there are also two rings which interconnect the 4 
+n end 4 hi PMEs. These internal rings provide the path tor messages to move between the external tori 

Since the APAP can be In our preferred embodiment a four dimensions) orthogonal array, the internal 
rings allow messages to move throughout the array in all dimensions. 

The PMEs are self-contained stored program mlaocompurers comprising a main store, local store, 
>b operaion decode, arithmetic*; ic unit (ALU), working registers and InpuVOuiput VO ports. The PMEs have 
Ihe capability of retching and executing stored Instructions from (heir own main store In bUMD operation or 
to fetch and execute commends via the Bd Interface In SIMD mode. This interface permits irriercommimi- 
cation among the controller, the PME, end other PMEs In a system made up of muhlple chips. 

The BCI Is the node's Interface to ihe external array controller element and to an array director. The 
at> bci provides common node functions such as timers and docks. The bci provides broadcast function 
masking for each nodal PME and provides ihe physical interface and buffering for ihe broadcast-bue»to- 
PME data transfers, and also provides (he nodal interface to system status and monitoring and debug 
elements. 

Each PME contains eeperate interrupt levels to support each of its poiru-tc-point interlaces and ihe 

29 broadcast Interlace. Data Is input to the PME main store or output from PME main store under Direct 
Memory Access (DMA) control. An "input transfer complete 0 interrupt is available for each of the interlaces 
to signal the PME software that data is present. Slants information is available lor the software to determine 
the completion of data output operations. 

Each PME has a "circuit switched mode" of I/O in which one oi its four input ports can be switched 

30 olrecfly to ones of lis Tour output ports, without having the data enter ihe PME main store. Selection of Ihe 
source and destination of the "circuit swrich" is under control of the softw'aie executing on the PME The 
other three input ports continue to have access to PME main store functions, while the fourth input is 
switched to an output port 

An addieonai type of I/O has data that must be broadcast to, or gathered from all PMEs, plus data 

39 which is too specialised to tit on the standard buses. Broadcast data can include SIMD commands, MIMD 
programs, and SIMD data- Gathered data is primarily status and monitor function?. Diagnostic and test 
functions are the specialized data elements. Each node, in addition to the included set of PMEs, contains 
one BCI. During operations the BCI section monitors the broadcast interface and steorcfccltectB broadoasl 
data to/from the addressed PME(s). A combination of enabling masks and addressing tags are used by the 

49 BCI to determine what broadcast information Is intended for which PMEe. 

Each PME is capable of operating in SIMD or in MIMD mode in our preferred embodiment In SIMD 
mode, each instruction is ted into the PME from the broadcast bus via the BCI, The BCI buffers each 
broadcast data word untf* aO of Its selected rwdai PMEs have used it This syixJiTonizatton provides 
accommodation of the data Cming dependencies associated with the execution of SIMD commands and 

<& allows asynchronous operations to be performed by a pme» in mimd mode, each PME executes its own 
program from Its own main store. The PMEs are initialized to the SIMD mode. For MIMD operations, the 
external controller normally broadcasts ihe program to each of She PMEs while they are in SIMD mode, and 
then oommands the PMEs to switch to MIMD mode and begin executing. MaskingAagging fre broadcast 
rnrormation allows dtiferont sets of PMEs to contain different MIMD programs, and. or selected sets of PMEs 

so to operate In MIMD mode while other sets of PMEs execute In SIMD mode. In various software clusters or 
partitions these separate functions can operate independently of the actions in other dusters or partitions. 
The operation of *he Instruction Set Archrcecwre (ISA) of the PME will vary slightly depending on whether 
Ihe PME is in the SIMD or mimd mode. Most ISA instructions operate identically regardless of mode. 
However, since the PME In SIMD mode does not perform branching or other control functions some code 

S9 points dedicated lo those MIMD instructions are reinterpreted in SiMO mode to allow Ihe PME to perform 
special operations such as searching main memory for a match to a broadcast data value or switching to 
MiMD mode. This further extends system flexibility of an array. 
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PME Architecture 

Basically, our preferred architecture comprises a PME which has a 16 bit wide date flow, 32K of 16 bit 
memory, specialized I/O ports and I/O svrrtehmg paths, plus the necessary control logic to perms each PME 

5 to fetch, decode and sxscuie the 13 bit Instruction set provided by our instruction set architecture (ISA). 
The preferred PME performs the functions of a virtual router, find thus performs bom the processing 
functions and data rouisr functions. The memory organization allows by cross addressing of memory 
between PMEg access to a large random access memory, and direct memory for the PME. The individual 
PME memory can be all local, or divided into local and shared global areas programmaftcalfy, Specialised 

10 conJrcJs and capabilities which we describe permit rapid task switching and retention of program state 
information at each of the pmes interrupt execution levels- Although some of the capabilities we provide 
have existed in other processors, their application for management of interprooessor I'O is unique in 
massively parallel machines. An example is the integrate of the message router function into the PME itself. 
This eflmlneies specialized router chips or development of specialized VLSI routers. We also recognize that 

ss in some msianoes one could distribute the functions we provide en a single chip onto several chips 
interconnected by rnetalization layers or otherwise and accomplish improvements to massively parallel 
machines. Further, as our architecture is scalable from a single node to massively parallel supercomputer 
ievet machines, it is possible to utilize some of our concepts at different levels. As we will illustrate lor 
example our PME data How Is very powerful, and yet operates to make die scalable design effective, 

20 The PME processing memory element develops for each of the multiple pme$ o* & node, a fully 
distributed architecture. Every PME wiH bo comprised of processing capability with 16 bit data flow, 6*K 
bytes of local storage, store and forward/circuit switch logic PME to PME communication. Simd/mimd 
switching capabilities, programmable routing, and dedicated floating point assist logic. These functions can 
be independently operated by the PME and integrated with other PMEs within the same chip to minimize 

as chip crossing delays* Referring to FIGURES 7 and 6 we Illustrate the PME dataflow. The PME consists of 
16 or? wide dataflow 425, 435, 445, 4S5, 4«5, 32K by 16 bit memory 420, specialized I/O peris 400, 410, 
480, 4G0 and I/O switching paths 425, plus the necessary control logic to permit the PME to fetch, decode 
end execute a 16 bit reduced instruction set 430. 440, 450, 480. The special logic also permits the PME to 
perform as bosh the processing unit 460 and data router. Specialized controls 405, 406, 407. 408 and 

so capabilities are Incorporated to permit rapid task switching and retention of program state Information at 
eech of the pmes" interrupt execution levels. Such capabilities have been included in other processors; 
however, their application specifically for management of interprocessor I/O is unique in massively parallel 
machines. Specifically, It permits the integration of the router function Into the PME without requiring 
specialized chips or VLSI de*/elopment macros. 

35 

16 bit internal data flog and control 

The major parts of the internal data flow of the processing element are shown in FIGURE 7, FIGURE 7 
illustrates the Internal data How of the processing element. This processing element has a full 1 a bit internal 

o data flow 425. 435. 445. 455, 465, The Important paths of the internal data flows use 12 nanosecond hard 
registers such as (he OP register 450, M register 440, WR register 470, and the program counter PC 
register 430. Tries© registers feed the fuay distributed ALU 460 and I/O router registers and logic 405. 409, 
407, 400 tor all operations. Wfth cumsm VLSI technology, the processor can execute memory operations 
and instruction steps at 25 Mhz, and it can build the important elements. OP register 450. M register 440, 

<s wr register 470* and the PC register 430 wWi 12 nanosecond hard registers. Other required registers are 
mapped to memory locations. 

As seen in FIGURE 9 (ha internal data Bow of the PME has its 32K by 16 bit main store in the form of 
two DRAM macros. The remainder of the data flow consists of CMOS gate array macros. All of the memory 
can be formed with the logic with low power CMOS DRAM deposition techniques to form an very large 

so scaled Integrated PME chip node. The PME Is replicated 8 times In the preferred embodiment of the node 
chip. The PME data flow consists of a 18 word by 16 bit genera* register stack, a muRMunction 
artihmetlologlc unit (ALU) working registers to buffer memory addresses, memory output registers, ALU 
output registers, operation/command, I/O output registers, and multiplexors to select inputs to the ALU and 
registers Current CMOS VLSI technology for 4MByte DRAM memory with our logic permits a PME to 

ss execute instruction steps et 25Mfu. Wo are providing the OP register, the M register, the WR register and 
the general register stack wfth 12 nanosecond hard registers. Other required registers are mapped to 
memory locations wiwn a PME. 



2i 



EP 0 570 729 A2 



The PME date flow is designed as a 16 bit integer arithmetic processor. Special multiplexor paths have 
been added to optimize subroutine emulation of n x 18 bit floating point operations rti«>1). The 16 bit data 
flow permits effective emulation of floating point operations. Specific paths within the data Sow have been 
Included to permit floating point operations in as little as 10 cycles. The ISA Includes special code point to 

s permit subroutines for extended (longer lhan 16-bit) operand operations. The subsequent floating point 
performance is approximately one twentieth trie fixed floating point performance. This performance is 
adequate to eliminate the need for specie! floating point chips augmenting the PME as is characteristic ot 
other massively parallel machines. Some other processors do include the special floating point processors 
on the same chip as a single processor (See FIGURE 1). We can enable special floating point hardware 

jo processors on the seme chip with our PMEs but we would now need additional cells than is required for the 
preferred ernbediment For floating point operations* see also the concurrently filed FLOATING POINT 
application referenced above for impirovemenfd to the IEEE standard. 

The approach developed is wen poised lo take advantage or the normal increases in VLSI technology 
performance. As circuit size shrinks end greater packaging density becomes possible then data flow 

» 5 elements like base and index registers, currently mapped to memory could be moved to hardware. 
Likewise, floating point sub-steps are accelerated with additional hardware which we wtl prefer for the 
developing CMOS DRAM technology as refieble higher density levels ere achieved. Very importantly, this 
hardware alternatrve does not affect any software. 

The PME is initialled to S1MD mode with interrupts disabled. Commands are fed Into the PME 

20 operation decode buffe* from the BCl. Each ame an insmrcSon operation completes, the pme requests a 
new command from the BCL In a simitar manner, immediate data is requested from Bis BCl at the 
appropriate point In the instruction execution cycle. Most instructions of the ISA operate identically whether 
the PME is in SIMD mode or in MIMD mode, with the exception of that SIMD instructions and immediate 
dala are taken from the 6CI; m MIMD mode the PME maintains a program counter [PC) and uses (hat as 

29 the address within own memory to fetch a 19 bit instruction. Instrucdoos such as " Branch * which 
explicitly address the program counter have no meaning in SIMD mode and some of those code points are 
reinterpreted to perform special &IMD functions as comparing immediate data against an area of main store 

The PME instruction decode logic permits either SlMDfl/IIMD operational modes, and PMEs can 
transition between modes dynamically, in SWD mode the PME receives decoded Fraction information 

30 and executes thai dala in the next dock cycle. In fVllMD mode the PME maintains a program counter PC 
address and uses that as the address within He own memory to fetch a 10 bit instruction. Instruction decode 
and execution proceeds as in most any other RISC type machine. A PME in SIMD mode enters MIMD 
mode when given the Information thai represents a decode brand). A PME In MIMD mode enters the SIMD 
mode upon executing a specific instruction for the transition. 

33 When PMEs transition dynamically between SIMD and MIMD modes, an MIMD mode is entered by 
execution of a SIMD "write cootroi register" instruction with ihe appropriate control bit set to a r l At the 
completion of the SIMD instruction, trie PME enters the MIMD mode, enables interrupts, and begins 
fetching and executing its MIMO instructions irom the main store location specified by its general register 
RO. interrupts are masked or unmasked depending on the state of interrupt masks when the MIMD control 

4$ bit is set. The PME returns to SIMD mode either by being externally reinitialized or by executing a MIMD 
"write control register" instruction with the appropriate control bit set to zero. 

Data communication paths and control 

<*s Returning to Figure 7 it will be seen that each PME has 3 input pons 400, and 3 output pons 480 
intended for on-chip communication plus 1 I/O port 4t0, 480 for off chip communications. Existing 
technology > rather than Die procecrjor idea, require* that the olf-chlp port be byte wide half duplex. Input 
ports are connected such that data may be routed from input to memory, or from input AR register 406 to 
output register 408 via direct id bft data path 4£5. Memory would be the data sink for messages targeted at 

so the PME or for messages that were moved in 'store and forward' mode. Messages thai do not target the 
particular PME are serf directly to the required output port, providing a 'circuit switched' mode, when 
blocking has not occurred The PME 9/W Is charged with performing the routing and determining the 
selected transmission mode. This makes dynamically selecting between 'circuited switched' and 'store and 
forward' modes possible. This Is also another unique characteristic of the PME design. 

ss Thu3, our preferred node has 8 PMEs end each PME has 4- output ports (Left, Right, Vertical, and 
External). Three of the Input ports and three of the output ports are 10-bit wide tun duplex point-to-point 
connections to the other PMEs on the chip. The fourth ports are combined m the preferred embodiment to 
provide a half duplex poin Mo-point connection to an off-chip PME. Due to pin and power constraints that we 
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have imposed io make use of the less dense CMOS we employ, the actual eft-chip interface is a byte-wide 
path which la used to multiplex two halves of the inter-PME data word. With special "2lppor'' circuliry which 
provides a dynamic, temporary logical breaking of infermodsl rings to allow data to enter or leave an array, 
these external PME ports provide t ho APAP external I/O array function. 

s For data routed to the PME memory, normal DMA is supported such that the PME Instruction stream 
must Decora* involved in the K> processing only at the beginning and end of messaged. Finally, data thai te 
being 'circuit twitched to an Internal output port Is forwarded wifihour clocking. This permits single cycle 
data transfers within a chip and detects when chip crossings will occur such that the fastest but still reflable 
communication can occur. Past forwarding utilizes forward data paths and backward control paths, ell 

70 operating in transparent mode. In essence, a source looks through several stages to see the acknowledg- 
ments from the PME performing a DMA or off-chip transfer. 

As seen by FIGURES 7 and 8 Data on a PME input port may be destined for the local PME, or for a 
PME further down the ring. Data dee lined tor a PME further down the ring may be stored in the local PME 
main memory and then forwarded by the local PME towards the target PME {store and forward)* or the local 

sq input pore may be logically connected io a particular local output port (circuit switched) such that the daia 
passes ''transparently through ihe local PME on its way to the target PME. Local PME software 
dynamically controls whether or not the local PME is in "store and forward" mode or in "circuit switched 11 
mode tor any of the four inputs and tour outputs. In circuit switched mode, the PME concurrently processes 
all functions except the I/O associated wish (he circuit switch: In store and forward mode the PME suspends 

20 all other processing functions to begin the t'O fcrwarding process. 

While data may be stored external of the array in a shared memory or DASO (with external controller), 
ft may be stored anywhere in the memories provided by PMEs. Input data destined lor the local PME or 
buffered in (tie local PME during "store and forward" operations is placed into local PME main memory via 
a direct memory access (address) mechanism associated wilfc each of the input ports. A program interrupt 

ss is available to indicate that a message has been loaded into PME mam memor/. The loca* PME program 
interprets header data to determine if the data destined for the local PME is a control message which can 
be used io set up a drcutt-swiiched path to another PME. or whether it is a message to be forwarded to 
another PME. Circuit switched paths are controlled by local PME software. A circuit switched path logically 
couplea a PME input pain dirsctty to an output pah without passing through any intervening buffer storage. 

30 Since ilia output paths between PMEs on the same chip have no Intervening buffer storage, data can enter 
the chip, pass through a number of PMEs on the chip and be loaded into a target PME r s main memory in a 
single clock oyclel Only when a circuit switch combination leaves the chip, is an intermediate buffer storage 
required. This reduces Ihe effective diameter of the APAP array by a number of unbuffered circuit switched 
paths. As a result data can be sent from a PME to a target pme in as few clock cycles as there are 

38 intervening chips, regardless of the number of PMEs In the path. This kind of routing can be compered to a 
switched environment in which at each node cycles are required H> carry data on to the nexi node. Each of 
our nodes has 8 PMEs! 

Memory and Interrupt Levels 

40 

The PME contains 32K by 16 bit -420 dedicated storage words. This storage is completely general and 
can contain both data and program. In SlMD operations all of memory could he data as is characteristic of 
other SIMD massively parallel machines, m MIMD modes, me memory is quite normal; but unlike most 
massively parallel MIMD machines the memory is on the same chip wit) the PME and is thus, immediately 

js available. This fflen eliminates the need for cache-frig end cache coherency techniques characteristic of 
other massively parallel MIMD machines. In the case for Instance of inmos's chip, only 4K resides on the 
chip, and external memory Interface bus and pins are required. These are eliminated by us. 

Low ordei storage locations are used to provide a set of general purpose registers for each interrupt 
level. The particular ISA developed for the PME uses short address fields for these register references, 

so Interrupts are utilized to manage processing, I/O activities and SAY specified functions (i.e., a PME in 
normal processing will switch to an Interrupt level when incoming to initiates), if the level is not masked, 
the switch is made by changing a poimer In H/W such that registers are accessed from a new section of 
low order memory and by swapping a single PC value. This technique permits last level switching end 
permits SW to avoid the normal register save operations and also to save status within tha interrupt level 

98 registers. 

The PME processor operates on one of eight program interrupt levels. Memory addressing permits a 
partitioning of me lower 576 words of memory amoung the eight levels of interrupts. 84 of these 576 words 
of memory are directly addressable by programs executing on any of the eight levels. The other 512 words 
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are perttforrec! into eight 64 ward segments. Each 64 word segment is directly accessible only by programs 
executing on Kb associated Interrupt level, indirect addressing techniques are employed for allowing all 
programs to ecce$9 all 32K words of PME memory* 

The interrupt tevels are assigned to the input ports, the BCi, and bo error handling. There is a "norma!" 
s level but there Is no "privileged \ nor "supervisor 1 * leveL A program interrupt causes a context switch In 
which the oontsna of the PC program counter, status/contioi register, and selected general registers are 
stored In specified main memory locations and new values for these registers are retched from other 
specified main memory locations. 

The PME data flow discussed with reference to FIGURES 7 and 8, may be amplified by reference to 
jo the additional sections below. In a complex system, the PME data (low uses the combination of the chip as 
an array node with memory, processor and VO which sends and receives messages wish the BCi that we 
replicate as the baste buying block of an MMP built witi our APAP. The MMP can handle many word 
lengths. 

*5 PME Multiple length Date Row Processing 

Tiie system we describe can perform the operations handled by current processors wift the data flow in 
the PME which is 16 bits wide. This is done by performing operations on data lengths which are multiples 
of 16 bits. This Is accomplished by doing the operation In 19 bit pieces. One may need to know the result 
v> of each piece (i.e. was it zero, was there a carry out of the high bits of the sum). 

Adding two numbers of 46 bits can be an example of dais flow. In this example two numbers of 46 bits 
\ a(CM7> and b«M7) ) are added by performing the following In the hardware: 

0(32-47) + tX32-47)->ens(32-47> step cm 

29 

1 ) eave the carry out of high bit of sum 

2) remember if partial result was zero or non-zero 

a) a(1 6-31 ) * b{1 6-01 ) ♦ save carry->an6(16-51> step two 



1) eave the carry out of high bit of sum 

2) remember if partial result was zero or non-zero from tills result and from previous step; If both are 
33 zero remember zero; if either is non-zero remember non-zero 

6X0-15) * b(<M5) + saved carry->en3<0-1 5) final step 



& 1) rf this piece ft zero and lost piece was zero ana i* zero 

2) [his piece is zero and last piece was non-zero ens is non-zero 

3) this piece is non-zero ans is positive or negative based on sign of sum (assuming no overflow) 

4) if carry Into sign of ans os not-equal to carry out of sign of answer, ans has wrong sign and result is 
an overflow (can not properly represent in the available bits) 

45 The length can be extended by repeating the second step in the middle as many times as required If the 
length wt re 32 the second step would not be performed, the length were greater than 46, step two would 
be done multiple limes. If the length were just 16 the operation In step one. with conditions 0 and 4 of the 
final step would be done. Extending the length of the operands to multiple lengths of ihe data sow t$ a 
technique having a consequence that the instruction usually taxes longer to execute for a narrower data 

so flow. That Is. a 32 bit add on a 32 btl data flew only takes one pass through ihe adder logic, while the seme 
add on a 1 8 bit data flow takes two passes through the adder logic. 

What we have done that is interesting is that in me current implementation of ihe machine we have 
slngte Instructions which can perform adds/5ubtract3/compafes/moves on operands of length 1 to 8 words 
(length is defined as part the instruction). Individual instructions aveiable to the programmer perform the 

59 same kind of operations as shown above (or steps one. two, and final (except to the programmer the 
operand length is longer Le. 16 to 128 bits). At the bare bones hardware level, we are working on 16 bits at 
a time, but the programmer thinks s/he's doing 16 to 126 bite at a time. 
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By using combination b of these instructions, operands of any length can be manipulated by the 
programmer Le. two instructions can be ueeti to add two numbers of up to 269 bits in length. 

PME Processor 

Our PME processor is different from modem microprocessors currently utilized for MPP applications. 
The processor portion differences include: 

1. the processor is a fuly prc^arnmabie hardwired computer {see the ISA description for an instruction 
set overview) with: 

io o it has a complete memory module shown in the upper right comer (see FIGURE 8), 

o it has hardware registers with controls required to emulate separate register sets for each Interrupt 

level (shown m upper left corner), 
o iis ALU has me required registers and controls to permit effective mulU-cycle integer and floating 

point arithmetic, 

js a 3 has I/O switching paths needed to support packet or circuit switched data movement between 

PMEe Interconnected by point-to-point links shown In die lower right comer, 

2. This is our minimaHsi approach to processor design permitting eight replications of the PME per chip 
with the CMOS DRAM technology, 

3. This processor portion of the PME provides about the minimum dataflow width required to encode a 
50 ter Instruction Set Architecture <ISA) -see Tables - which is requ«ed to permit effective MIMD or SIMD 

operation of our MMP, 

PME Resident Software 

20 The PME is the smallest element of the APAP capable of executing a stored program It can execute a 
program which is resident in soma external control element and fed to it by the broadcast and control 
Interface <BCI) in SiMD mode or it can execute a program which is resident In Its own main memory {MIMD 
mode). 11 can dynamically switch between SIMD mode and MflMD mode representing S1MD/MIMD mode 
duality functions, and also the system can execute these dualities at the same time <SIMIMD mode). A 

so particular PME can make this dynamic switch by merely setting or resetting a btl in a control register. Since 
SMD PME software is actually resident in the externa) control element, further discussion of this may be 
found in our discussion of the Array Director and in related applications, 

MIMD software Is stored into the PME main memory while the PME is in 6IMD mode. This Is Feasible 
since many of the PMEe will contain 'identical programs because they will be processing similar data in an 
ss asynchronous manner. Here we would note that these programs are not fixed, but they can be modified by 
loading the MIMD program from an external source during processing of other operations. 

Since the PME instruction set architecture represented in the Tables is that of a microcomputer, there 
are few restrictions with this architecture on the functions which the PME can execute. Essentially each 
PME can function like a RISC microprocessor. Typical MIMD PME software routines are listed below: 
40 1. Basic centre! program for dispatching and prioritizing the varicwe resident routines, 

2, Communication software to pass data and control messages among the PMEs. This software would 
determine when a particular PME would go Into/out of the "circuit switched" mode. It performs a "store 
and Ward" f unction as appropriate. It also Inmates, sends, receives, and terminates messages between 
its own main memory and that of another PME. 
4* 3. interrupt handling software completes the context twitch, and responds to en event which has caused 
£ie interrupt. These can vndude fail-safe routines and rerouting or reassignment of PMEs to an array, 
A, Routines which Implement the extended Instruction Set Architecture which we describe below. These 
routings perform macro level instructions such as extended precision fixed point artihmetlc, floating point 
arithmetic, vector arithmetic, and the like. This permits not only complex math to be handled but image 
so processing activities for display of image data in multiple dimensions {2d and 3d Images) and multimedia 
processes, 

5. Standard mathematical Horary functions can be included. These can preferably Include UNPAK and 
VPSS routines. Since each pme may be operating on a different element of a vector or matrix, the 
various PMEs may eli be executing different routines or differing portion* of the seme matrix fit one time, 
99 6. Specialized routines for performing ecetler.fyiiher or sorting functions which take advantage of the 
APAP nodal interconnection structure and permit dynamic muW-dimcnslonci routing are provided. The 
routines effectively tafce advantage of some amount of synchronization provided among the various 
PMEs. while permitting asynchronous operations to continue. For sorts, there are sort routines. The 
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APAP Is weO suited to a Batcher Sort. Because that sort requires extensive calculations to determine 
particular element to compare versus very short comparison cycles. Program synchronization Is man- 
aged by the I/O statements The program allows multiple data elements per PME and very large parallel 
sorts in quite a straight forward manner, 
s While each PME has Its own resident software, lhe systems made from these microcomputers can 
execute higher level language processes designed for scalar and parallel machines. Thus the system can 
execute application programs which have been written tor UNIX machines, or those of other operating 
systems. In high level languages such as Fortran. C, C+ +. FortranD, and so on. 

It may be an interesting footnote that our processor concepts use an approach to processor design 
70 which is quite old. Perhaps thirty yeans of use of a similar ISA design has occurred in IBM's military 
processors, we have been the *st to recognize that this kind of design can be used <o advantage to 
leapfrog the dead ended current modem microprocessor design when combined with our total PME design 
to move (he technology So a new path for use in the next century. 

Although the processor's design characteristics are quite different from other modern microprocessors, 
>5 similar gate constrained military and aerospace processors have used the design since the '60s. It provides 
sufficient Instructions and registers Tor straight forward compiler development and both general and signal 
processing applications are effectively running on this design* Our design has minimal gate requirements, 
and SM has implemented some similar concepts for years when embedded chip designs were needed 
general purpose processing. Our adoption now of parte of Hie older ISA design permits use of many LittMles 
aj> and other software vehicles which will enable adoption of our systems a; a rapid rate because of the 
existing base and the knowledge thai many programmers have of the design concepts. 

PME I/O 

29 The PME wBI interface to the broadcast and control Interface (BCl) bus by either reading data from the 
bus info me ALU via the path labeled BCl in FIGURE 8 or by fetching instructions from the bus directly into 
the decode logic (not shown). The PME powers up In SiMO mode and in thai mode reads, decodes and 
executes instructions until encountering a branch. A broadcast command SIMD mode causes the transition 
to MIMD with instructions fetched locally. A broadcast PME instruction 'INTERNAL DIOW reverts the stale. 

30 PME I/O can be sending data, passing data or receiving data. When sending data, the PME seis the 
CTL register to connect XMIT to eitter L> R» V, or X. H/W services fren pass a block of data from memory 
to the target via the ALU multiplexer and the XMfT register. TNs processing interleaves with normal 
Instruction operation. Depending upon application requirements, the block of data transmitted can contain 
raw data tor a predefined PME and/or commands to establish paths. A PME that receives data will store 

as Input to memory and interrupt the active lower level processing. The interpretation task at the interrupt, level 
can use the interrupt event to do task synchronization or Initiate a transparent V0 operation f>fien data Is 
addressed elsewhere,) During the transparent I/O operation, the PME is tree to continue execution. Its CTL 
register makes it a bridge. Data will pass through if. without gating, and it will remain in thai mode until an 
instruction or the data stream resets CTL WhBe a PME la passing data it cannot be a data source, but it 

40 can be a data sink for another message. 

PME Broadcast Section 

This is a chip-to-common control device interface. It can be used by the device that serves as a 
45 controller to oommand I/O* or tee? and diagnose the complete chip, 

Input is word sequences (either instruction or data) that are available to subsets of PMEs- Associated 
with each word Is a code hicfleaUng which PMEs will use ihe word. The ihe BCl will use the word both to 
limit a PMPa access to the interface and to assure that e« required pme$ receive data. This serves to 
adjust the BCl to the asynchronous PME operations. (Even when In SIMD mode PMEs are asynchronous 
so due to I/O and Interrupt processing.) The mechanism permits PMEs to be tormed into groups which are 
controlled by interleaved sets of command/data words received over the BCl. 

Besides delivering data to the PMEs, the 601 accepts request codes from the PME combines them, 
and transmits Ihe Integrated request This mechanism can be used In several ways. MIMD processes can 
be initiated In a group of processors that ail end with an output signal. The 'Am ol tignate triggers the 

as controller to initiate a new process. Application 3, in many cases, will not be able to load all required 9W in 
PME memory. Encoded request to the controller wil be used to acquire a $/w overlay from perhaps the 
host's storage system. 
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The controller uses a serial scan loop through many chips to acquire information on individual chips or 
PMEs. Tfcese loops Initially interface id the DCI but can in the BCl be bridged to Individual PMEs. 

Broadcast and Control interface 

5 

The BCl broadcast and control interface provided on each chip provides a paraSei input interface such 
that data or instructions can be sent to the node. Incoming data is tagged wfth a subset identifier and the 
BCi Includes the functions required to assure that all pmes within the node, ooeiattag within the subset, are 
provided the data or Instructions, The parallel Interface at the BCl serves both as a port to perron data to be 
ro broadcast to all PMEs and as the instruction interface during SIMD operations. Satisfying both requirements 
plus extending those requirements to supporting subset operations is completely unique to this design 
approach. 

Our BO parallel input interface permits data or instructions to be sent from a control element that is 
external to the node. IT)© BCl contains "group assignment* registers (see the grouping concepts In our 

ss abovt application entitled GROUPING OF SIMD PICKETS) which are associated with each of the PMEs. 
Incoming data words are tagged with a group Identifier and the BCl Includes the functions required to 
assure that all PMEs wrftin the node which are assigned to the dedicated group are provided the data or 
instructions. The parallel interface of the BCl serves as both a port to permit data to be broadcast to the 
PMEs during M1MD operations, and as the PME InatnicUonflmmedlate operand Interface during SIMD 

20 operations. 

The BCl dso provides wo serial interfaces. The high spaed serial port will provide each PME with the 
capability to output a limited amount of status inhumation. That data is intended to: 

1. signal our Array Director 610 when the PME, e,g, 500, has data mat needs to be read or that the PME 
has completed some operation. U passes a message to the external control element represented by the 

29 Array Director. 

2. provide activity status such that external test and monitor dements can fflustrate the statu* of me 
entire system. 

The standard serial port permits the external control element means tor selectively accessing a specie 
PME few monitor and control purposes. Data passed over this interface can direct data from the BCl parallel 

3D interface to a particular PME register or can select daia from a particular PME register and route it to the 
high speed serieJ port. These control points allow the external control element to monitor and control 
individual PMEs during initial power up and diagnostic phases. It permits Array Director to input control data 
so ae to direct the port to particular PME and node Internal registers and access points. These registers 
provide paths such that PME of a node can output data to the Array Director, and these registers permit the 

ss Array Director to input date to the units during initial power up and diagnostic phases. Data input to access 
point can be used to control tost and diagnostic operations, fe. perform single instruction siep> stop on 
compare, break points, etc 

Node Topotogy 

40 

Our modified hypercube topology connection is most useful for massively parallel systems, while other 
less powerful connections can be used with our basic PMEs* WltMn out Initial embodiment of the VLSI chip 
are sight PMEs with toay distributed PME Internal hardware connections- The Internal PME to PME chip 
configuration is a Lwo rings of four PMEs, with each PME also having one connection to a PME in the other 

-a ring. For the case of eight pmes in a vlsi chip this la a three dimensional binary hypercube, however our 
approach In general does not use hypercube organizations within the chip. Each PME also provides tor the 
escape of one bus. In the Initial embodiment the escaped buses form one ring are called + Y, +W and 
+Z, while those from the other ring ere labeled similarly except - (minus). 

The specific chip organization i3 referred to as the node of the array, and a node can be in a cluster of 

so the array. Trie nodes are connected using + -X and +-Y Into an array » to create a cluster. The 
dimensionality of the array is arbitrary, and In general greater man two which is the condition required for 
developing a binary hypercube. The clusters are then connected using *-W, *-Z Into a array of clusters. 
Again, the dimensionality of the array is arbitrary. The result is the 4 -dimensional hypercube of nodes. Tne 
extension to a S-dimensionaJ hypercube requires the usage of a 10 PME node and uses the additional two 

55 buse3, say +-EI to connect the 4-dimensional hypercube into a vector of hypercubes. We have then shown 
the pattern of extension to either odd or even radl* hyporcubeo. This modified topology limits the cluster to 
cluster wiring while supporting the advantages of the hypercube connection. 
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Our v/ireabilriy and topology configuration for massively parallel machines has advantages in keeping 
(he X and Y dimensions within our duster level of packaging, and in distributing ihs W and 2 bus 
connections to a# the neighboring dusters. After implementing the techniques described* the product will be 
wircablo, and manufccturabte while maintaining the Inherent characteristics of the topology defined. 

s The node consists of k ' n PMEs plus the Broadcast and Control Interface (DCl) section. Here 'n M 
represents the number of dimensions or rings, which characterize the modified hypercube. while V 
represents the number of rings thai characterize the node. Although a node can contain k rings fi Is a 
characteristic of the concept that only too of those rings may provide escape buses. V and *k ,r is limited 
in our preferred embodiment, because of the physics) chip package toN B 4andk ic 2. This limitation te a 

so physical one, and different chips sets mil allow other and increased dimensionality in the array. In addition 
to being a pan of the physical chip package* it is our preferred embodiment to provide a grouping of pmes 
that interconnect a set of Hngs in a modified hypercube* Each node will have 8 PMEs with their PME 
architecture and ability to perform processing and data router functions. As such, n is the oSmensfonality of 
the modified hypercube (see following section), Le., a 4d modified hypercube's node element would be 8 

'5 PMEs while a 5d modified hypercube's node would be 10 PMEs, For visualization of nodes which we can 
employ, refer (o FIGURE 6» as well as FIGURES 9 and 10 for visualization of interconnections and see 
FIGURE 11 for a block diagram of each node. FIGURES 16 and 10 elaborate on possible attercoiwiections 
tor an APAP. 

It will be noted thai the application entitled 'METHOD FOR INTERCONNECTING AND SYSTEM OF 
70 INTERCONNECTED PROCESSING ELEMENTS 1 ' Or CO-invenftr David B. Rolfe. filed in the United SttfSS 
Parent and Trademark office on May 13, 1991, under USSN CFr®98jB8Q t described the modified hypercube 
criteria which can preferably be used In connection with our APAP MMP, 

That application is incorporated by reference and describes a method of interconnecting processing 
elements in such a way that the number of connections per element can be balanced against the network 
2j diameter (wots* case path length). Tnis Is done by creating a topology that maintains many of the well 
known and desirable topological properaea of hypercubes while improving its flexibility by enumerating the 
nodes of the network in number systems whose base can be varied. When using a base 2 number system 
this method creates the hypercube topology- The invention has fewer interconnections than a hypercube. 
uniform connections and preserves the properties of a hypercube. These properties include: i) large 
30 number of alternate paths, 2) wery high aggregate bandwidth, and 3) well understood and existing methods 
that can be used to map other common problem topologies with the topology of the network. The result is a 
generalized non-binary hypercube with less density. It wiD be understood that with the preference we have 
given to the modified hypercube approach, in some applications a conventional hypercube can be utilized. 
In connecting nodes, other approaches to a topology could be used; however, the ones we describe herein 
aa are believed to be new and an advance, and we prefer the ones we describe. 

The interconnection methods for the modified hypercube topology for Interconnecting a plurality of 
nodes in a network of PMEs: 

1. defines a sets of integers ei, e2, e3, „, such the product of all elements equals (he number of PMEs 
m the network called M while the product of a» elements in the set excepting e1 and e2 Is the number 
40 of nodes ceiled N, and the number of elements in the set ctfied m defines the dimensionality of the 
network n by the relationship n = n>2. 

Z addresses a PME located by a set of indexes at. a2, ... am, where each Index la the PMEs position In 
the equivalent level of expansion such thai the Index al Is In the range of zero to eM for \ equal to 1, 2, 
torn., by the formula 

(,..(a<m)*elnv1) ♦ a (nri-2»e(n>1) .,. 8(2)^(1 )>+a(1) 

where the notation a® has the normal meaning of the the Jin element in the list of elements called a, or 
equivalently for e. 

so 3. connects two PMEs (with addresses f and g) If and only it either of the following two conditions hold: 

a. the stteger part of (ei - e2) equals tl>e integer part of s/(ei * e2) and: 

1. the remainder part of r/e1 differs from the remainder pan of s/ei by 1 or, 

2. the remainder part of rfe2 differs from the remainder pan of s/e2 by 1 or e2-i. 

b, the remainder part of r/ei drifera from the remainder pert of a/el for I in me range 3. 4, „. m and the 
so remainder part of r/el equals the remainder part of s&2 which equals i minus three, and the 

remainder part of r/o2 differs from the remainder part of s/e2 by e£ minus one. 
As a result the computing system nodes will form a non-binary hypercube, wHft the potential for being 
different radix in each dimension. The node is defined as an array of PMEs which supports 2*n ports such 
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that the pores provided by nodes match the dimensionality requirements of the modified hyporcube, ft the 
set of Integers 33. e4. ... em, which define the specific extent of each dimension of a particular modified 
hypercube are all taken as equal, say b, and if ei and a2 are taken a 1, than the previous formulas for 
addressability and connections reduce to: 

2. addressing a PME as numbers representing the base b numbering system 

3. connecting two computing elements (f and g) if and only if the address of f differs from the address of 
g in exactly one base b digit, using the rule that 0 and b-1 are separated by 1 . 

4. the number of connections supported by each PME is 2"n 

Which is exactly as described in the base application, with (hs number of communication buses spanning 
noo-aolacent pme$ chosen as zero. 

Infra-Node PME Inieroonneotions: 

rs PMEs are configured within the node as a 2 by n array. Each PME is inter connscted with three 
neighbors (edges wrap only in the second dimension) using a eel of input/output ports, thus, providing full 
duplex oommunjeation capability between PMEs- Each PMEs external input and output port is connected to 
node I/O pins. Input and output ports may be connected to share pins for half-duplex communication or to 
separate pins for fulf-dupte* capability. The Interconnections for a 4d modified hypercube node are shown 

&o in figures 9 and 10. (Note that where n fc$ even the node can be considered to be a 2 by 2 by n/2 array.) 
FIGURE 9 iitostraros the the eight processing demerits 500, 510, 520, 530, 540, 550, 560, 570 within 
the node. The PMEs are connected in a binary hypercube communication network. This binary hypercube 
displays three intra connections between PMEs (501, 51 1 521, 531, 54i, 551, 561, 571, 590. 591, 592, 593), 
Communication between the PME is controlled by in and out registers under control of the processing 
element This diagram shows the various paths that can be taken to escape I/O out any of the eight 
directions. +-vr 525, 565. +-x 515. 555. +-y 505. 545, +-z 535, 575. The communication can be 
accomplished without storing the data I wo memory if desired. 

It may be noted that *hBe a network switch chip could be employed to connect various cards each 
having our chip with other chips of tne system, it can and should daairatfy be eliminated. Our inter PME 

3D network that we describe as the "4d torus" is the mechanism used for Inter PM&communlcation. A PME 
can reach any other PME In the array on mis interface. (PMEs in between may be Stor&Forward or Circuit 
Switched) 

Chip Relationships for Interconnections 

We have discussed Use chip, and FIGURE n shows a block diagram of the PME ProcessotfMemory 
chip. The chip consist d the following elements each of which will be described in later paragraphs: 
1. 8 PMEs each consisting of a 16 bit programmable processor and 32K words of memory (6*K bytes), 

2. Broadcast interface (BCi) to permit a controller to operate all or subsets of the PMEs end to 
40 accumulate PME requests. 

3. Interconnection Levels 

a. Each pme supports four 8 brt wide tater-PME communication paths. These connect to 3 
neighboring PMEs on the chip and 1 off chip PME, 
b* Brcadcast-to-PME busing . which makes data or instructions available. 
4$ c. Service Request lines that permit any pme to send a code to the controller. The bci combines the 
requests and forwards a summary. 

d* Serial Service loops permit the controller to read all detail about fte functional blocks. This level of 
interconnection extends from the BCI to ail PMEs (FIGURE 11 for ease of presentation omits this 
defer).) 

so 

Interconnection and RouhnQ. 

The MPP will be implemented by repitealon oi the PME. The degree of replication does not affect the 
interconnection and routing schemes used. FIGURE 6 provides an overview of the network interconnection 
5s scheme. The chip contains 6 PMEs with interconnections to their immediate neighbors. This interconnection 
pattern resulte in the three dimensional cube structure shown in FIGURE 10. Each of the processors within 
the cube has a dedicated bldireceonal byte port to the chip's pins; we refer to the sat of 0 PMEs as a node. 
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An n by n array of nodes is a cluster. Sirnpte bridging between the + and - x porta and the + and • y 
parte provide the cluster node Interconnections, Here the our preferred chip or node has 8 PMEs, each of 
which manages a single external port This distributes the network control function and eliminates a possible 
bottleneck for ports. Bridging the outer edges mates the duster Into a logical torus. We have considered 

s clusters with n=4 and n=8 and believe that n =8 Is the better solution tor commercial applications while 
n»4 te better for military conduction cooled applications. Our ooncept does not Impose an unchangeable 
duster size- On the contrary, we anticipate some applications using variations. 

An array of dusters results in the 4 dimensional torus or hypercube structure illustrated In FIGURE iQ. 
Bridging between the + and - w ports and + and - z ports provides the 4d torus interconnections- This 

»> results in each node within a cluster connected to an equivalent node in al adjacent clusters. (This provides 
$4 ports between two adjacent dusters rather then the 6 ports that would result from larger dusters.) As 
wnh the cluster size, the scheme does not imply a particular size array. We have considered 2x1 arrays 
desirable lor workstations and MIL applications and 4x4. 4x8 and 8x8 arrays lor mainframe applications. 
Developing an ar?ay of 4d toruses Is beyond the gate, pin, and connector limitations of our current 

« preferred chip. However. *ax limitation disappears with our alternative on-chip optical driver/receiver is 
employed. In this embodiment our network could use an optical path per PME; wUh 12 rather than 8 PMEs 
pet chip the enay of 4d toruses with mufii-TfJop (Teraflop) performance and it seems to be economically 
feasible to make such machines available for the workstation environment Remember that such ariernarJve 
machines will use the application programs developed for our current preferred embodiment 

20 

4d duster Organization 

For constructing a 4d modified hypercube 360, as illustrated in FIGURES 5 and 10, nodes supporting 3 
external ports 315 are required. Consider the external ports to be labeled as +X P +Y. *2, +W, -X, -Y, -Z, 

29 -W, Then using mi nodes, a ring can be constructed by connecting the to -X ports. Again such 
rings can be interconnected into a ring of rings by interconnecting the matching *Yto -Y peris. This level 
of structure will be called a cluster 320. With rrvi =m 2 ^8 it provides for 51? PMG6 and such a cluster will 
be a building block for several size systems <330, 340, 350), as Illustrated with m=8 in FIGURE 8. 

30 4d Array Organization 

For buffeting large fine-gramed systems, sets of ma clusters are interconnected in rows using the + Z to 
■Z ports. The mc rows are then Interconnected using the * W to -W porta. For mi *..jtu -8 this results In 
system with 32768 or PMEs. The organization does not require that every dimension be equally 

33 populated as shown in FIGURE G (large tine-grained parallel processor 370). in the case of the fine-grained 
small processor, only e duster might be used wlih the unused z end w ports being interconnected on the 
card. This technique saves card connector pins and makes possible the application of this scaleble 
processor to workstations 340, 350 and avionics applications 330, both of which are connector pin limited. 
Connecting +f- ports together in the Z and W pairs leads to a workstation organization that permits debug. 

<& test and large machine software development 

Again, much smaller scale versions of the structure can be developed by generating the structure with a 
value smaller than m = 8. This win permit construction of single card processors compatible with the 
requirements tor accelsnators m the PS/2 or RISC System 6000 workstation 350, 

49 I/O Performance 

I/O performance includes overhead to setup transfers and actual burst rate data movement. Setup 
overhead depends upon application function ¥0 complexity and network contention. For example, an 
application can program circuit switched traffic with buffering to resolve conflicts or it can have all PMEs 
so transmit left and synchronize. In (he first case. I/O is a major task and detailed analysis would be used to 
size it We estimate that simple case setup overhead is 20 to 30 clock cyctes or & to 1 2 u-sec. 

Burst rate 10 Is the maximum rate a PME can transfer daxa (with either an on or off chip neighbor.) 
Memory access limits set the data rate at 140 nsec per byte, corresponding to 7.14 Mbyte/s. This 
performance includes buffer address and count processing plus data mag/write. It uses seven 40ns cycled 
ss per 18 bit wend transferred. 

This buret rate performance corresponds to a duster having o maximum potential transfer rate of 3.85 
Gbytes/s- It also means that a set of eight nodes along a row or column of the cluster will achieve 5? 
Mbyte's burst data rate using one set of their 6 available ports. This number is significant because I/O with 
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the external world will be done by logically 'unxtpping' an edge ot the wrapped cluster and attaching It to 
the external system bus. 

Inter-PME Routing Protocol 

3 

The simd/mimd PUB comprises inierprocsssor oommunfcanon to the external iro facilities, broadcast 
corrarol imerf&ces. and switching features which aftow both SIMOMIMD operation whWn the same PME. 
Embedded in the PME is flia fuiy distributed progrcmmable I/O lOMter for processor communication end 
data transfers between PMEs, 
to The PMEs have fully distributed interprocesBor communication hardware to on-chip PMEs as well as to 
the external VO fecHftes which connect to the interconnected PMEe tn the modified hypercube conffegura- 
ttort- this hardware is comptemented wan the flexible proQrammabilfiy of Hie PME to control the 1*0 activity 
via software. The programmable I/O router functions provide for generating data packets and packet 
addressee. With this information the PME can send the information mm the network of PMEs In a directed 
;s method or out multiple paths determined by any fault tolerance requirements. 

Distributed tauli tolerance algorithms or program algorithms can lake advantage of lha proaranvnabRlty 
along with the supported circuit switched modes of the PME. This performance combioationai mode 
enables everything from off-line PMEs or optimal path data structures to be accomplished via the 
programmable I/O router 

so Our study of applications reveals that it is sometimes most efficient to send bare data between pmes. 

At other times applications require data and routing information. Further, it is sometimes possible to pdan 
communications so that network conflicts cannot occur: other applications offer the potential for deadlock, 
unless mechanisms for buffering messages at Intermediate nodes are provided. Two examples fflustrate me 
extremes. In the relaxation phase of a PDE solution, each grid point can be allocated to a node. The inner 

£0 loop process of acquiring data from a neighbor can easily be synchronized over en nodes. Alternatively, 
image transformation* use local data ptfamsfers to determine communication target or source identmer*. 
TOs results in data moves through mulfipfe PMEs. and each PME becomes involved In the routing task tor 
each packet. Preplanning such traffic is generally not possible. 

To enable the network to be efficient for all types of transfer requirements, we pertraon, between the 

30 H/W and 8/W, the responsibility tor data routing between PMEs. &W does most ot the task sequencing 
function. We added special features to the hardware (HAW) to do the inner loop transfers and minimize 
sofewere (8/W) overhead on the outer loops, 

I/O programs at dedicated interrupt levels manege the network. For moat applications, a PME dedicates 
tout Interrupt levels to receiving data from the four neighbors. We open a buffer at each level by feeding 

38 registers at the level, and executing the IN (it uses buffer address and transfer count but does not await 
data receipt) and RETURN instruction pair. Hardware then accepts words from the particular Input bus and 
stores them to the buffer. The buffer fun condrfJon will men generate the interrupt and restore me program 
counter w (he instruction after the RETURN, This approach to interrupt levels permits 1*0 programs to be 
written that do not need to test whet caused the Interrupt Programs reed data, return, and then continue 

40 directly into processing the dete rosy read. Transfer overhead is minimized as moot situation* require Rifle 
or no register saving. Where an application uses synchronization on I/O, as in the POE example, then 
programs can be used to provide that capability. 

Write operations can be started in several ways. For the PDE example, at the point where a result is to 
be sent to a neighbor, the application level program executes a write call. The call provides buffer location t 

<s word count and target address. The write subroutine includes the register loads and OUT instructions 
needed to initiate the H/W and return to the appticstlon. H W does the actual byte by byte data transfer. 
More complicated output requirements will use an output service function at the highest Interrupt towel. Both 
application end interrupt level tasks access met service via a soft interrupt. 

Setting up circuit switched paths builds on these simple read and write operations. We start wSh all 

so PMEs having open buffers sized to accept packet headers but not the data. A PME needing to send data 
initiates the transfer by sendmg an address/data block to a neighboring PME whose address baiter marches 
the target In the neighboring PME address Information will be stored; due to buffer fuD an Interrupi will 
occur. Hie Interrupt &W tests (he target address and will either extend (he buffer to accept the data or write 
me target address to an output port and set the Oil register for transparent data movement. (This allows 

58 the PME to overlap its application executions with the circuit switched bridging operation.) The CTL register 
goes to busy state and remains transparent unta reset by the presence of a signal at end of data stream or 
abnormally by PME programming. Any number of variations on these themes can be implemented. 
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System I/O and Array Director 

FIGURE 12 shows an Array Director in the preferred embodiment, which may perform the (unctions of 
the controller of FIGURE 13 which describes the system bus to array connections, FIGURE 13 is composed 

s of two parts, (a) (he bus to/from a cluster, and part <b) the communication of informal ton on the bus to/from 
a PME, Loading or unloading the array is done by competing the edges of clusters to tlie system bus. 
Multiple syswm buses can be supported with muhapte cluster*. Each duster supports 50 to 57 Mbyte's 
bandwidth. Loading or unloading the parallel array requires moving data between ail or a subset of the 
PMEs and standard buses <ie MicroChannel, YME-bue, FufureBue, etc). Those buses, part of the host 

jo processor or array controller, are assumed to be rigidly specified. The PME Array therefore must be 
adapted to the buses* The pme Array can be matched to the bandwidth of any bus by interleaving bus data 
onto n PMEs, with n picked to permit PMEs both L : 0 and processing time, FIGURE 13 shows bow we might 
connect the system buses to the PMEs a£ two edges of a cluster. Such an approach would permit 114 
Mbyte/s to be supported, it also permits data to be loaded at naff the peak rate to two edges slmuJta- 

>s nsousry. Although this reduces the barioVicfih to 57 Mbytes/cluster, It has the advantage of providing 
orthogonal data movement within the array and ability lo pass data between two buses. {We use those 
advantages to provide fast transpose and matrix multiply operation.) 

Ae shown in part (a) of FIGURE 1S, the bus "dots to all paths on the edges of the duster, and, the 
controller generates a gate signal to each path in the required Interleave timing. If required to connect to a 

ao system bus wifh greater than 57 Mbyte/e, the data will be interleaved over multiple clusters. For example, in 
a system requiring 200 Mbyte/s system buses, groups of Z or 4 dusters will be used. A large MPP has the 
capacity to attach 10 or 94 such buses to its xy network paths. By using the w end 2 paths In addition to the 
x and y paths, that number could be doubled, 

FIGURE 13 pert (b) shows how the dala routes to individual PMEs. The FIGURE shows one particular 

39 w>,y or z path that can be operated at M3 Mbyte/s in burst mode- If the data on the system bus occurred 
in bursts, and if the PME memory could contain ihe complete burst, then only one PME would be required. 
We designed the PME I/O structure to require neither of these conditions. Data can be gated into the 
PMExO at the fun rate until buffer full occurs. At that Instant PMExO w3l change to transparent and PMBci 
will becpn accepting the date. Within PMExO prrceseing of the input date buffer can begin, PMEs ihat have 

30 taken dala and processed it are limited because (hey cannot transmit the results while in (lie transparent 
mode. The design resolves this by switching the data stream to the oppoeiie end of the path at intervals. 
FIGURE 13(b) shows that under SAY control one might dedicate PMExO through PMExS to accepting data 
while PMEx12 through PMEx15 unload results and visa-versa. The controller counts words and adds end of 
block signals to the data stream, causing the switch in direction. One count applies to all paths supported 

39 by the controller so controller workload is reasonable. 

SYSTEMS FOR ALTERNATIVE COMPUTERS 

FIGURE 18 Illustrates a system block diagram for a host attached large system with a single application 

<*? processor interface CAR). This iflustratfon may also be viewed wnh the understanding mat our invention may 
be employed in stand alone system which use multiple application processor interfaces (not shown) This 
configuration will support DASD/Orahpics on all or many clusters. Workstation accelerators can eliminate the 
host, application processor Interface (API* and duster synchronizer (CS> illustrated by emulation. The CS 
not always required. It win depend on the type of processing that is being performed, as well as Ihe 

43 physics* drive or power provided for a particular application which uses our invention. An application this is 
doing mostly MIMD processing will not place as high a workload demand on the controller, so here the 
control bus can see very stow pulse rise times. Conversely, system doing mostly asynchronous A-SIMD 
operations with many independent groupings may require taster control busing, in this case, a cluster 
synchronizer will be desirable. 

so The system btocfe diagram of FIGURE 18 Illustrates that a system might consist of host, array controller 
and PME array. The PME array is a set of clusters supported by a set of cluster controllers (CC). Although 
a CC Is shown lor each cluster that relationship Is not strictly required. The actual ratio of clusters to CCb Is 
flexible. The CO provides redrlve to» end accumulation Irom the 84 BCIs/duslers. Therefore, physical 
parameters can be used establish the maximum ratio. Additionally, the CC will provide for controlling 

so multiple independent subsets of the PME array; that service might also bocome a gating requirement A 
study can be made to 4&otm\m these requirements tor any particular application of our invention. Two 
versions of the CC will be used. A cluster that is to be connected to a system bus requires the CC providing 
interleave controls (see System UO and FIGURE 16) and tri-state drivers. A more simple version thai omits 
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the tri'Etcte busing features can also be employed. In the case of large systems, a second stage of redrive 
and accumulation is added. This level Is the duster synchroniser (OS). The set of CCb plus CS and the 
Application Processor Interlace (API) make up the Airay Controller. Only the API is a programmabis unit. 
Several variations of this system synthesis scheme will be used. These result in different hardware 

5 configurations for various applications, but they do not have a major Impact on the supporting software. 

For a workstation acoeleretor, the duster controllers will be attached direcfty to the workstation system 
bus; the function of the API will be performed by the workstation. In the case of a RISC/6000, the system 
bus Is a Micro Channel and the CC units can plug directly Into the slots within the workstation. This 
configuration places the I'O devices <DASD. SCSI and display interfaces) on the same bus that 

w loadsAinioads the array. As such the parallel array can be used Cor I/O intensive tasks like real time image 
generation Qr processing. For workstations using other bus systems {VME*bus« FutureBue. a gateway 
interface will be used. Such modules are readily available in the commercial marketplace. Note that in these 
minimal scale systems a single CC can be shared between a determined number of dusters, and neither a 
CS nor an API Is needed 

;a A MIL avionics application might be similar in size to a workstation, but it needs different interfering. 
Consider what may become the normal military situation. An existing platform must be enhanced with 
adcfitionaJ processing capability, but fending prohibits a complete processing system redesign. For this we 
would attach to the APAP array a smart memory coprocessor, tn this cose, a spedal application program 
Interface API that appears to the host as memory will be provided. Data addressed to the host's memory 

so wBl then be moved to the array vie CC($). Subsequent writes to memory can be detected and enerpieted as 
commands by ihe API so that the accelerator appears to be a memory mapped coprocessor 

Large systems can be developed as either host attached or as stand alone configurations. For a host 
attached system, the configuration shown in FIGURE 18 is useful. The host will be responsible for I/O, and 
the API would servo as a dispatched task manager. However, a large stand alone system is also possible in 

£S special situations. For example, a database search system might eliminate the host, attach DA$D to the 
MicroChannel* of every cluster and use multiple APIs as bus masters slaved to me PMEs. 

Zipper Airay Interface with External I/O 

3D Our zipper provides a fast VO connection scheme and is aecomplshed by placing a switch between two 
nodes of the array. This switch will allow for me parallel communication into and out of the array. The fast 
I'O will be implemented along one edge of the array rings and acts like a large zipper into the K Y, W, Z 
rings. The name "zipper connection* is given to the fast WO. Allowing data to be transferred into and out of 
the network while only adding switch delays to transfer the data between processors is a unique loading 

35 technique. The switching scheme does not disrupt the ring topology created by the X V, W, 2 buses and 
special support hardware allows the zipper operation to occur while the pe is processing or routing data. 

The ability to bring data into and out of a massively parallel system rapidly is en important 
enhancemenl to the performance of ihe overall systeraWe believe lhai the way we implement our test I/O 
without reducing the number of processors or dimension of the array network is unique in the fteW of 

<o massively parallel environments. 

The modified hypsreube arrangement can be extended to permit a topology which comprises rings 
within rings. To support the interface to the external I/O any or all of the rings can be logically broken. The 
two ends of the broken ring can then be connected to external I/O buses. Breaking the rings is a logical 
operation so as to permit regular intar-PME communication al certain time intervals while permitting I/O at 

4$ other time intervals* This process of breaking a level of rings within the modified hypercube effectively 
'unzips 1 rings for I/O purposes. The test I/O "zipper" provides a separate interface into the array. This zipper 
may exist on 1 to n edges of the modified hypercube and could support dither parallel input into multiple 
dimensions of the array or broadcast to multiple dimensions of the array- Further data transfers Into or out 
of the array could alternate between the two nodes directly attached to the zipper. This I/O approach is 

so unique and it permits developing different zipper sizes to satisfy particular application requirements. For 
example, in the particular configuration shown m FIGURE 6, called me large fine-grained processor 365, the 
zipper for the Z and W buses will be dotted onto ihe MCA bus. This approach optimizes the matrix 
transposition time, satisfying a particular application requirement for the processor. For a more detailed 
explanation of the "zipper" structure, reference may be had to the APAP I/O ZIPPER application filed filed 

55 concurrently herewith. The zipper b here illustrated by Figure 14. 

Depending on the configuration and the need of the program to roll data and program into and out of 
the individual processing elements, the size oS the z\pp& can be varied. The actual speed of the I/O zipper 
is approximately the number of rings attached limes ihe PME bus width, times the PME dock rate atl 
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divided by 2. (The division permSs the receiving PME lime to move data onward. Since it can send it to any 
of n pieces I/O contention ie completely absorbed over the Array.) Witti existing technology le.< 5 MB/sec 
PME trensfe* rate, 94 rings on the zipper, end Interleaved to two nodes transfers, 320MB/*ec Array transfer 
rates are possible, (dee the typical zipper configuration in FIGURE 14). FIGURE t & illustrate* the feet t-0 or 
s the so-called 'zipper connection* 700, 710 which extst9 as a separate Interface into trie array. This zipper 
may exist on one 700 or two edges 700, 710 of the hypercube network by dotting onto the broadcast bus 
720, 730, 740, 750, at multiple nodes in the array 751, 752, 753, 754 and In multiple directions 770, 780, 
790> 7$l. 7SS« ?67. 

Today's MCA bus supports 60 to 160 MB per second burst transfer rate and therefore Is a good match 
70 for a single zipper in simple or non-interleaved mode. The actual transfer rate given channel overhead and 
efficiency is something less than that For systems that have even more demanding I/O requirements, 
multiple zippers and MCA buses can be utilized These techniques are seen to be important to processors 
that would support a large external storage associated with nodes or clusters, as might be characteristic of 
database machines* Such I/O grow* capability is completely unique to this machine and has not previously 
is been incorporated in either massively parallel, conventional single processor, or coarse-grained parallel 
machines. 

Array Director Architecture 

so Our massively parallel system is made up of nodal building Nocks of multiprocessoi nodes, dusters of 
nodes, and a/rays of PMEs already packaged in dusters. For control of these packaged systems we 
provide a system array director which with the hardware controllers performs the overall Processing 
Memory Element (PME) Array Controller functions in me massively parallel processing environment The 
Director comprises of three functional areas, the Application Interface* &e Cluster Synchronizer, and 

29 normally a Cluster Controller. The Array Director wil have the overall control of the PME array, using the 
broadcast bus and our zipper connection to steer data and commands to ail or the PMEs. The Array 
Director functions as a software system interacting with the hardware to perform the rote as Uhe shell of the 
operaSng system. The Array Director In performing this role receives commands from the application 
interface and issuing the appropriate array instructions and hardware sequences to accomplish the 

30 designated task. Hie Array Director's main function Is to continuously feed the Instructions fe> tlis PMEs and 
route data in optimal sequences to keep the traffic at a maximum and collisions to a minimum. 

The APAP computer system shown in FIGURE 6 is illustrated in more detail in the diagram of FK3URE 
12 which Illustrates the Array Director which can function as a controller, or array controller, as illustrated In 
FIGURE 13 and FIGURES 1$ and 19. This Anay Director 610 illustrated in FIGURE 12 is shown in the 

as preferred embodiment of an APAP in a typical configuration of n identical array clusters 665, 670, 680, 690. 
with an array director 6i0 for the clusters of 512 PMEs, and an application processor interface 630 for the 
application processor or processors 600. Hie synchronizer 650 provides the needed sequences to the array 
or cluster controller 640 and together they make up the "Array Director" 610, The application processor 
interface 630 wyi provide the support for the host processor 600 or processors and test/debug workstations. 

49 For APAP unSs attached to one or more hosts, the Array Oiredor serves as the interface between the user 
and the array of PMEs- For APAPs functioning as stand alone parallel processing machines, the Array 
Director becomes the host unit and accordingly becomes involved in unit r/0 activities. 

The Array Olrector will consist of the following tour functional areas: <see the functional block diagram In 
FIGURE 12) 

45 1 . Application Processor interface (API) 600* 

2, Cluster Synchronizer (OS) 650 (0 x 8 array of clusters), 

3, Cluster Controller (CO) 640 (8 x 1 array oF nodes), 

4, Fast lr'0 (zipper Connection) 620. 

so The Application Processor Interface (API) 630: 

When operating In attached modes, one API will be used for each hose Thai API will monitor the 
incoming data stream to determine what ore instructions to the Away dusters 635. 670, 600, 690 and what 
are data for the Fast lO (zipper) 620. When In standalone mode, the API serves as the primary user 
55 program host. 

To support these various requirements, the APIs contain the only processors within the Array Director, 
plus the dedicated storage tor the API program and commands. Instructions received from me host can call 
tor execution of API subroutines, loading of API memory with additional functions, or tor loading of CC snd 
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PME memory with naw 9/W. As described in die 3 W avsrviev/ section, these venous type requests can bs 
restricted to subset oF users via the Initial programs loaded Into the API. Thus> Ute operating program 
loaded will determine (he type of support provided which can be tailored to match the performance 
capability of the API, This farther permits the APAP to be adjusted to the needs of multiple users requiring 
3 managed and well tested services, or to the Individual user Ashing to obtain peak performance on a 
particular application. 

The API also provides for managing the path to and from the bO zipper. Data received from the host 
system In attached modes, or from devices in standalone modes to forwarded to the Array* Prior to initiating 
this type of operation the PMEe with* the Array which will be managing the VO are initiated. PMEs 

70 operating in MIMD mode can utilize the fast interrupt capability and either standard SAV or special Junctions 
for this transfer while those operating in $IMD modes would have to be provided detailed control 
instructions. Data being sent from the VO zipper requires somewhat the oppose* conditioning. PMEs 
operating in MIMD modes must signal the API via the high speed serial inter Face and await a response from 
the API, while PMEs in SIMD modes are already In synchronization with the API and can ftetefore 

;s immediately output data. The ability to system switch between modes provides a unique ability to adjust the 
program to the application. 

Cluster Synchronizer (03) $50 

20 The CS 650 provides the bridge between the API 630 and CC 640. It stores API 630 output in FIFO 
stacks and monitors the status being r ©Turned from the CO 650 (both parallel input acknowledges and high 
speed serial bus data> to provide trie CC, in timely fashion, with the desired routines or operations that need 
to be started. The CS provides the capability to support different CCs and different PMEs within clusters so 
as lo permit dividing the array into subsets. This is dona by partitioning the array and then commanding the 
involved cluster controllers to selectively forward the desired operation. The primary function of the 
synchrontzsr is to keep all clusters operating and organized such that overhead time is minimized or buried 
under the covers of PME execution time We have described how the use of she duster synchronizer in A- 
SIMD configurations is especially desirable. 

30 Cluster Contrpaer (CC) 640 

The CC 640 interlaces to the node Broadcast and Control Interface (BCI) 60S for the set of nodes In an 
array cluster 635. (For a 4d modified hypercube with 8 nodes per ring that means the CC 340 Is attached to 
64 BCls 606 in an 8 by 6 array of nodes aid is controlling 512 PMEs> Sixty-four euch clusters, also In a 8 

33 by 8 array, lead to the fun up system with 32768 PMEs.) The CC 640 will send commands and data 
supplied by fre C$ 660 to the BCi parallel port and return the acknowledgement date to Uie C$ 650 when 
operating in MIMO modes, in SIMD mode the interface operates syrichronously, and step by step 
acknowledgments are not required. The CC 640 also manages and monitors the high speed serial port to 
determine when PMEs within the nodes are requesting services. $uch requests are passed upward to the 

40 CS 690 while the raw data from the high speed serial interlace ie made evaibbts to the etetua fisptey 
interface The CC 640 provides the CS 660 with an interface to specific nodes within the cluster via the 
standard speed serial Interface. 

In SIMD mode the CC win be directed to send instructions or addresses to all the PMEa over the 
broadcast bus. The CC can dispatch 18 bit instruction to all PMEs every 40 nanoseconds when in SIMD 

<s mode. By broadcasting groups of native inductions to the pme, the emulated Instruction set is formed. 

When in MIMD mode the CC will wait for the endop signal before issuing new instructions to the PMEs. 
The concept ol the MIMD mode is to build airings a? mlcro-rouUnes with native instructions resident In the 
PME. These strings can be grouped together to form the emulated festruciions, and titese emulated 
instruction can be combined to produce service canned routines or library functions, 

so When In SIMD/MMD (SIM1MD) mode, the CC will Issue instruction as if In SIMD mode and check for 
endop signals horn certain PMEs. The PMEs mat are in MIMD win not respond to the broadcast instructions 
and will continue with there designated operation. The unique status indicators will help the CC to manage 
this operation and determine when and to whom u> present !he sequenced instructions. 

58 Operational Software levels 

This application overviews the operational software S/W levels to provide further explanation of the 
services performed by various hardware HAY components. 
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Computer systems generally used have an operating system. Operating system kernels which are 
relatively complete must be provided in most massive M1MD machines, where workstation class CPU chips 
run kernels such as Mech. The operating system kernel support* message pe$$ing or memory coherency. 
Otner massively parallel systems based upon SIMD models have almost no intelligence in the array. There 
s are no 'program counters" out in the array, and thus no programs to execute locally. All Instructions are 
broadcast 

(n the systems we have provided with our PME as the basis for duster arrays, there is not need lor an 
operating system at each chip, a node- We provide a library of hey functions for computation and/or 
communication within each PE <PME> that can be invoked at a high teveL SiMD-lfce Instrucdons are 

jo broadcast to (he anray to set each of a selected set of PMEs. These PMEs can (hen perform in full MIMD 
mode one or more of these library routines. In addSon. basic trnarupt handler end communications routines 
are resident in each PME allowing ihe PME to handle communtcafcn on a dynamic basis. Unlike existing 
MIMD machines, (he APAP structure need not include an entire program in PME memory. Instead ail of that 
code, *Mch Is essentially serial. Is the duster controller, tnus such code, 90% by space and 10% by time 

>5 (typically) can be broadcast in a SIMD fashio to an array of PMEs. Only the truly parallel inner loops are 
distributed to ihe PMEs bi a dynamic fashion. These are then initiated into MIMD mode Just as other 
"library" routines are. This enables use of program models which are Single Program Multiple data to be 
used where the same program is touted In each PME node, with embedded synchronization code, and 
executed at the local PME, Design parameters affect bandwidth available on different links, and the system 

20 pains are progtemrnetteaiiy configurable, allowing high bandwith finks on a target network, and allowing 
dynamic partition of off chip like PME-to-PME links to provide more bandwidth on specific paihs as meets 
Ihe needs of a particular application, five Bnks leaving a chip mate directly with each other, without the need 
for eternal logic There are sufficient links and there is no predesigned constraint as to which other links 
they can attach to, so that the system can have a diversity of interconnection topologies, with routing 

29 performed dynamically and programmattcally. 

The system allows usage of existing compilers and parsers to create an executable paraiel program 
which could run on a host or -workstation based configuration Sequential source code for a Single Program 
Multiple Data system would pass through program analysis* for examination of dependency, data and 
controls, enabling extension of program source to incfud* call graphs, dependency tables, aliases, usage 

3D tables and the like, 

Thenafter, program transformation would occui whereby a modified version of the program would be 
created that extends the degrees of parallelism by combining sequences or recognizing patterns to 
generate expBcIt compiler directives. A next slep would be a data allocation and partitioning step, with 
message generation, which would analyze data usage patiemsnd allocate so that elements to be combined 

as would share common indexing, addressing pattern, and these would provide embedded program compiler 
directives and ca«s to communication service*. At this point the program would pass to a level partitioning 
step, A leva) partitioning step would separate the program into portions for execution in ARRAY, in ARRAY 
CONTROLLER (array director or cluster controller), and HOST. Array portions would be interleaved in 
sections with any required message passing synchronization functions. At this point level processing could 

40 proceed. Host sources would pass to a level compiler (FORTRAN) tor assembly compilation. CoffiroHsr 
sources would pass to a microprocessor controller compiler, and items thai would be needed by a single 
PME and not available in a library caS would pass to a parser (FORTRAN OR 0) to an intermediate level 
language representation wtiich would generate optimized PME code and Array Control ter code. PME code 
would be created at PME machine level and would include library extensions, which would pass on load 

*s into a PME memory. During execution a PME parallel program in the SPMD process of execution could call 
upon already coded assembly service functions from a runtime library kernel. 

Sinca the APAP can (unction as either an attached unit that ie closely or loosely coupled wilh its host or 
as a stand alow processor, some variation in the upper level S/w models exists. However, these variations 
serve to integrate the various type applications go ae to permit a single set of lower level functions to satisfy 

so all three applications. The explana&Ion wiU address (he attached version SAW first and then the modifications 
required tor standalone modes, 

k\ any system, as illustrated by FIGURE 10, where the APAP Is Intended to attach to a host processor 
the user's primary program would exist vdthln (he host and would delegate to the APAP unit tasks and 
associated data as needed to provide desired load balancing, The choice of interpreting the dispatched 

S3 task's program within the host or the Array Director is a user option. Host level interpretation permits the 
Array Director to work at Interleaving users which do not exploit close control of the Array, while APAP 
Interpretation leads to minimal latency In control branching but tends to limit the APAP time to perform 
multi-user management tasks. This leads to Ihe concept mat Ihe APAP and host can be tightly or loosely 
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coupled 

Two examples tfbietrate the extremes: 

i. When APAP i$ attached to 3090 cies* machines with Floating Point Vector FaciGfoa, user data in 
compressed form could bo stored within the APAP. A host program that called for a vector operation 

a upon two vectors with differing spa/seness characteristics would then send instructions to the APAP to 
realign tl* data into element by element matching pairs, output the result to ms Vector Facility, read 
answer from the Vector Facility and finally reconfigure data into final sparse data form. Segments of the 
APAP would be Interpreting and building sparse matrix bit maps, while o#er sections would be 
calculating how to move data between PMEe such that it would be properly aligned for the zipper 

70 2. With APAP attached to a smafl inflight military computer, the APAP could be performing the entire 
workload associated with Sensor Fusion Processing. The host might Initiate the process once, send 
sensor date as n was received to the APAP and then wart tor results. The Army Director would then have 
to schedule and sequence the PME array through perhaps dozens of processing steps required to 
perform the process. 

js The APAP will support three tevela of user control: 

1. Casual User. S/he works with supplied routines and library function. These routines are maintained at 
the host or API level and can be evoked by the user via subroutine calls within his program. 

2. Customizer User. &he can write special functions which- operate within the API and which directly 
evoke routines supplied with the API or services supplied with the CC or PME. 

20 X Development User. s*i>8 generates programs for execution in the CC or PME, depending upon API 
services for program load and status feedback. 

Satisfying these three user levels in either closely of loosely coupled systems leads to the partitioning 
of HW control tasks. 

£$ API Software Tasks 

The application program interface API contains S/W services that can tost the leading words of daia 
received and can determine whether that data should be interpreted by the API, loaded to some storage 
within ihe Array Director or PME, or passed to the I/O zipper. 

so For data that is to be Interpreted, (he API determines the required operation and invokes Ihe (unction. 
The most common type operation would call for fie Array to perform some function which would be 
executed as a result of API writes to the CS (and indirectly to the CC). The actual data written to the C$ CC 
would In general be constructed by the API operational routine baaed upon the parameters passed la the 
API from the host. Data sent to the CS/CC would in turn be forwarded to the PMEs via the node BCL 

38 Data could be loaded to enhsr API storage, CC storage, or PME memory. Further, data to be to acted to 
PME memory could be loaded vta either me I/O ^pper or via the node BCl For daia to be put into the API 
memory, the incoming bus would be read then written to storage. Dale targeted to the CC memory would 
be similarly read and then be written to the CC memory, Finally, data for the PME memory <in this case 
normally new or additional MlMD program s) could be sent to all or selected PMEs via the CS/CC/Nocte BCl 

<o or to e subset of PMEs tor selective redistribution via the VO zipper. 

When data is to be sent to ihe UO zipper, it could be preceded by inline commands that permit the 
PME MlMD programs to determine fts ultimate target or k rt could be preceded by calls to the API service 
functions to perform either MlMD initiation or SIMD transmission. 

In addition to responding to requests tor service received via the host interface, Ihe API program will 
respond to reouest from the PMEs. Such requests will be generated on the high speed serial port and will 
be routed through the CC/CS combination. Requests of this sort can result in the API program's d&nsctty 
servicing the PMEs or accessing ihe PMEs via the standard speed serial port to determine further qualifying 
data relative to the service request* 

as PME Software 

The software plan includes: 

o Generation oi PME resident service routines (that Is, 'an extended ISA') tor complex operations end 
i/O management 

95 o Definition and development of controller executed subroutines that produce and pee* control and 
parameter data to the PMEs via the BCl bus. These subroutines: 
i. cause a set of PMEs to do mathematical operations on distributed objects. 
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2. provide I/O data management and synchronization services for PME Array and System Bus 
Interactions, 

3. provide startup program load, program ovsriay and program task management for PMEs. 
o Development of data allocation support services for host level programs, and 

5 o Development of a programming support system including assembler, simulator, find H/W monitor and 

debug workstation. 

Based upon studies of military sensor fusion, optimization, Image transformation, US Post Office optical 
character recognition and FBI fingerprint matching applications, wo haw concluded that a parallel processor 
programmed with vector and array commands jllke BLAS calls) would be effective. The underlying 
to programming mods! must match the PME array characteristics feasible with today's technology. Specifi- 
cally: 

o PMEs can be Independent stored program processors, 
o The array can nave thousands of PMEs. arid be suitable for fine gram&d parallelism, 
o Intar-PME networks will have very high aggregate bandwidth and a smalt 'logical diameter', 
12 o But, by network connected microprocessor MIMD standards, each PME Is memory Kmlled. 

Prior programming on MIMD parallel processors has used task dispatching methodology, Such 
approaches lead to each PME needing access to an portion of a large program. This characteristic in 
combination wtih the non-shared memory characteristic of the WW, would exhaust PME memory on any 
significant problem. We therefore target what we believe is a new programming modal, called 
20 'asynchronous SIMD* (A-SIMD) type processing, m this connection see USSN 70^788, filed November 27, 
1991 of P. Kogge, which is incorporated herein 

A-SMD programming in our APAP design means that a group of PMEs will be Greeted by commands 
broadcast to them as In 8IMD models. The broadcast command will 1 " initiate execution of a MIMD function 
within each PME Tltat execua'on can involve data dependent branch^ and addressing within PMEs, and 
25 VO based synchronization with either other PMEs or the BCI. 

Normally. PMEs will complete the processing and synchronize by reading the next command from the 
BCI. 

The A-3IMD approach includes both MIMD and SIMD operating modes. Since the approach imposes no 
actual time limits on the command execution period, a PME operation that synchronizes on data transfers 
oo and executes indefinitely can be initialed. Such functions are very effective in data filtering, DSP, and 
systolic operations. (They can be ended by either BCI interrupts or by commands over the serial control 
buses.) $IMD operation results from any A»SlMD control stream fte? does not include MIMD Mode 
Commands, Such a control stream can include any of the PMEs native instructions. These instructions are 
routed directly lo the instruction decode logic of the PME, Eliminating foe PME instruction fetch provides a 

6 higher performance mode for tasks that do not Involve data dependent branching. 

This programming model (supported by H/W features) extends to permitting the array of PMEs to be 
divided into independent sections. A separate A-SIMD command stream controls each section. Our 
application studies show that programs of interest divide into separate phases (ie. inputs input buffering, 
several processing steps, and output formatting, etc,), suitable for pipeline data processing. Fine-grained 

40 parallelism results from applying the n PMEa In a section to a program phase. Applying coarse-grained 
partitioning to applications often results in discovering small repetitive tasks suitable k>* MIMD or memory 
bandwidth limned tasks suitable for SIMD processing. We program the MIMD portions using conventional 
techniques and program the remaining phases as A-SlMD sections, coded with vectorized commands, 
sequenced by the array controller. This makes the large controller memory the program store. Varying the 

45 number of PMEs per section parmhs balancing the workload- Varying the dispatched Task stze permits 
balancing the BCI bus bandwidth to the control requirement 

The programming model also considers allocating data elements to PMEs. The approach is to distribute 
data elements evenly over PMEs, In early versions of S/W, this will be done by ihe programmer or by 3A?V. 
we recognize that *e IBM parallelizing compiler technologies apply to this problem and we expect to 

so investigate their usage. However, the irtfer-PME bandwidth provided does tend to reduce the importantly oi 
this approach. This links data allocation and I/O mechanism performance. 

The H/W requires ihet the PME Initiate data transfers wit of its memory, and It supports a controlled 
write Into PME memory without PME progrem Involvement Input control occurs in the receiving PME by 
providing an input buffer address and a maximum length. When I/O to a PME results In buffer overflow, 

ss H/W will interrupt the receiving PME. The low level I/O functions that will be developed for PMEs build on 
this service. We will support either movement of raw data botwoen adjacent PMEs or movement oi 
addressed data between any PMEs. The last capability depends upon the circuit switched and store and 
forward mechanisms. The interpret address and forward operation is important tor performance. We h*v* 
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optimized the H/W and &W to support the operation. Using one word buffers results in art interrupt upon 
receipt of address header. Comparing target id with local Id permits output path ©©lection. Transfer of the 
subsequent data words occurs in circuit switched mode. A slight variation on this process using larger 
buffers results in a store and forward mechanism, 

3 Because of the high performance Inter-PME bandwidth. It Is not always necessary or desirable id place 
data elemenis within the PME Array carehtfy. Consider sniffing, a vector data element distributed across 
PMEs. Our architecture can send das without an address header, thus, providing for very fast I/O. However, 
we have found, in many applications, that oottmtetno; a data structure for movement In one direction, 
penalises data movement in an orthogonal direction. The penalty in such situations approximates the 

w average cost of randomly routing data in the network. This leads to applications where placing data 
sequentially or randomly {as opposed to arranging data* resufcs in shoiter average process times. 

Many applications can be synchronized to take advantage of average access time. (For example, PDE 
relaxation processes acquire data from a neighborhood and thuB. can average access over at least four I/O 
operations.) We believe that after considering the factors applicable to vector and array processes, ilka 

is scariten'gether or row/column arithmetic, many users will find brute force data allocation to be suitable for the 
application* However, we know of examples that Illustrate application characteristics (Bke required synchro- 
nization or biased utilization of shift directions 1 } that tend to force particular data allocation patterns. This 
characteristic requires tfhat the tools and techniques developed support eifter manual tuning of the data 
placement, or simple and non-optimum data allocation. (We will support the non-optimum data allocation 

so strategy with host level macros to provide near transparent port of vectorized host proems to the mpp. 
Tho H/W Monitor workstation will permit the user to investigate the resuharn performance.) 

FIGURE 19 shows the general S/W development and usage environment The Host Application 
Processor is optional in that program execution can be controlled from either the Host or the Mentor, 
Further, the Monitor will effectively replace the Array Controller is some situations. The environment wfll 

£0 support program execution on real or simulated MPP hardware. The Monitor Is scenario driven so that the 
developer doing test and debug operations can create procedures to permit effective operation at any level 
of abstraction. 

FIGURE 20 Hustrates the levels of HAW supported within the MPP and the user interfaces to these 
levels, 

so We see two potential application programming techniques far the MPP. In the least programmer 
intensive approacrx appfceHons would be wrKten in a vectorized high order language, rf the user dW not 
feel that the problem warranted tuning data placement then he would use compile time servrce? to allocate 
data to the PME Array. The application would uee vector calls like ELAS lhal would be passed to the 
controller for interpretation and execution on the PME Array. Unique calls would be used to move data 

ss between host and PME Array, tn summary, the user would not need to be aware of how the MPP organised 
or processed the data. Two optimization technique* *M be supported for this type application: 

1. Althing t!>e date allocation by constructing the data allocation table will permit programs to force data 
placements. 

2. Generation of additional vector commands for execution by the array controller will permit tuned 
<o eubfunctions <ie, calling the Gaussian Elimination as a single operation.} 

We also see that the processor can be applied to specialized applications as in those referenced in the 
beginning of this section- In such cases, code tuned to the application would be used. However even in 
such applications the degree of tuning will depend upon how important a particular task Is to the application, 
ft te in this situation that we eee the need tor tasks individually suited to SIMD, MIMD or A*8IMD modes. 
4s Tnese programs will use a combination of: 

1. Sequences of PME native instructions passed to an emulator function within the array controller. The 
emulator will broadcast the instruction and Hs r parameters to the PME eeL The PMEs in thle 6IMD mode 
wi» pass the instruction to the decode function, simulating a memory fetch operation. 

2. Tight Inner loops that can be I/O syrK*wonized will use PME native ISA programs. After initiation from 
so a SIMD mode change, they would run continuously in MIMD mode. (The option to return to SIMD mode 

via a 'RETURN' instruction exists.) 

3. More complicated programs, as would be written in a vectDrizing command set, would execute 
subroutines in the array controller that Invoked PME native functions. For example a simplified array 
controller program to do a BLAS k SAXPr command on vectors loaded sequentially across PMEs would 

■ Gaussian Eliminftiion wlih normal oivotmg rooulres shlfiino, rows bu« not columns. More than fct performance 
difference would re&ul; from arranging rhe data such that column a were on the fast shirr direction. Even with 
thai there is not an advantage to arranging rows in any particular relationship to the buses. 
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start sequences within the PMEs that: 

a. Enable PMEb with required x elements vta comparison of PME id wtih broadcast Incx' and 
^ ad*" values, 

b, Cora prose the k values via a write to consecutive PMEs, 

s c. Calculate the address of PMEs wllh y elements from broadcast data, 

d. Transmit the ocmpressad x data to the y PMEs, 

e. Do a single precision floating point operation in PMEb receiving x values to complete the operation. 
Finally* the SAXPY example Hiusfeetes one additional aspect of executing vectorised application 

programs. The major steps -execute in the API and could be programmed by efiher an optimizer or product 
m developer. Normally, the vectorUed application would call rather than include this levd o code. These steps 
would be written as C or Fortran code and will use memory mapped read or writes to control the pwie array 
via the BCI bus. Such a program operates the PME array as a sanee of MIMD steps synchronized by 
returns to Ehe API program. Minor steps such as the single precision floating point routines would be 
developed by *e Customer or Product Developer. These operations vrtil be coded using the native PME 
»5 ISA and wOt be tuned to the machine characteristics. In general, this would be the domain of the Product 
Developer since coding, East and optimization al this level require usage of the complete product 
development tool set. 

The APAP can have applications written In sequential Fortran. The path Is quits dnteront FIGURE 21 
ouUtnes a Fortran compiler which can be used. In the first step, tl uses a portion of the existing parallelizing 
a& compiler to develop program dependencies. The source plus these tables become an input to a process 
that uses a charadenzadon of the APAP MMP and the source to enhance parallelism. 

This MMP is a norvsharcd memory mactilne and as such allocates data between the PMEs for local 
end global memory. The very fast data transfer times and the high network bandwidth reduce the time 
affect of data allocation, but it stiB is addressed. Our approach treats pari of memory as global and uses a 
3 $/W service funcacn. It la also possible to use the dependency Information to perform the data allocation In 
a second ariemanve. The final step in converting the source to multiple sequential programs is performed 
by the Level Partitioning step This partitioning step is analogous to the i Fortran sup 3.ef work being 
conducted with DARPA funding. The last process In the compilation Is generation of the executable code at 
all individual functional levels. For the PME this will be done by programming the code generator on an 
30 existing compiler system. The Host and API code compilers generate the code targeted to those machines. 
The PME can execute MIMD software from its own memory. In general, the multiple PMEs would not 
be executing totally different programs but rather would be executing the same small program in an 
asynchronous manner. Three baste types of SJW can be considered although the design approach does not 
limit the APAP to just these approaches: 
as 1. Specialized emulation functions would make tie PME Array emulate the set of services provbde by 
standard user libraries like UNPACK or VP$$. In such an emulation package* the PME Array could be 
using its multiple set of devices to perform one of tte operations required in a normal vector can. This 
type of emulation, when attached to a vector processing unit, oould utilize Ihe vector unit for some 
operations while performing others Internally. 

2. The peralleiism of the PME Array oould be exploited by operating a set of software that provides a 
new set of mathematical and service functions in the PMEs. This set of primitives would be the codas 
exploited by a customizing user to construct his application. The prior example of performing sensor 
fusion on a APAP attached to a military pisiform would use such an approach. The customteer would 
write routines to perform Kalman Filters. Track Optimum Assignment and Threat Assessment using the 

<« supplied set of function names. This application would be a series of API call statements, and each call 
would result in initiating the PME set to perform some basic operation like 'matrix multiply* on data 
stored within the PME Array. 

3. In cases where no effective method, considering performance objectives, oi application needs exists 
then custom S/W could be developed and executed wrthtn the PME. A specific example is 'Sorf. Many 

so methods to sort data exist and the objective In all cas e s is to tune the process and the program to the 
machine architecture. The modified hypercubs is well suited to a Batcher Sort however, that sort 
requires extensive calculations to determine particular elements to compare versus very short compari- 
son cycles. The computer program in FIGURE 17 shows a simple example of a PME program iiOO to 
perform the Batcher Sort 1000 with one element per PME. Each line of the program description would be 

65 expanded to 3 to 6 PME machine level instructions, and all PMEs would fihen execute the program in 
MIMD mode. Program synchronization is managed via the 1/0 statements. The piogram extends to more 
data elements per PME and to very large parallel sons in a quite straight forward manner. 
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CC Storage Contents 

Data from the CC storage is used by the PME Array in one of two manners. Whan the PMEs are 
operating in 8IMD, a curios of instructions can bo fetched by the CC and passed to the nodo BCI, thus, 
s reducing load on both the API and CS, AllemaUveJy, [unctions that are not frequently required, such as PME 
Fault ReoonftQuraHon a-W, PME Diagnostics, and perhaps conversion routines can be stored in the CC 
memory. Such functions can then be requested by operating PME MIMD programs or moved to the PMEs 
at the request ol API program directives. 

10 Packaging of the frWay Modified Hypercube 

Our packaging techniques take advantage ot ft* eight PM£s packaged in a single chip and arranged in 
a N-dirnensionaJ modified hypercube configuration. This chip level package or node of the array is the 
smallest building block in the apap design- These nodes are *en packaged m an 8 X 8 eney where the 

;s X and the +-Y makes rings wHhin the array or cluster and the +-W, and are brought out to the 
neighboring clusters. A grouping of dusters make up an array. This step significantly cuts down wire count 
for data and control for tt>e array. The W end Z buses will connect to tl>e adjacent dusters and form w and 
Z rings to provide total connectivity around the completed array of various etea The massively parallel 
system will be comprised of these duster building blocks to form the massive array of PMEs. The APAP 

20 w3J consist of an 8 x 8 array of clusters, each cluster will have r» own controller and all the control** will 
be synchronized by our Array Director. 

Many trade-offs of wireebilsty end topology have been considered, yot wUh these considerations we 
prefer the configuration which m ilkjsirate with this connection. The concept disctoed nas the advantage of 
keeping the X and V dimensions within a cluster level of packaging, and distributing the W and Z bus 

20 connections to all the neig hoofing clusters. 

After implementing the techniques described. fr& product will be vareabte, and manuf&clurable white 
maintaining the Inherent characteristics of the topology defined. 

The concept used here Is to mix, match, and modify topologies at different packaging levels to obtain 
the desired results in terms of wire count. For the method to define the actual degree of modification of the 

30 hypercube, refer to (he Rolfe modified hypercube patent application referenced above. For the purpose of 
this preferred embodiment we will describe two packaging levels to simplify our description, ft can be 
expended. 

The first Is the chip design or chip package Illustrated by FIGURE 3 and FIGURE 11 . There are eight of 
the processing elements with their associated memory and communication logic encompassed into a single 

35 chip which is defined as a node. The internal configuration Is classified as a binary hypercube or a 

hypercube where every PME Is connected to two neighbors* See the PME-PME communication diagram in 
FIGURE 9, especially 500, 510. 520. 530. 540. 550, 560. 570, 

The second step is that the nodes are configured as an ft X 8 affray to make up a cluster. The fully 
populated machine Is built up of an array of 8 X 5 clusters to provide the maximum capacity of 32738 
PMEs. These 4096 nodes are connected in an 8 degree modified hypercube network where the commu- 
nication bstween nodes is programmable. This ability to program different routing paths adds flexibility to 
transmit different length messages. In addition to message length differences, there are algorithm optimis- 
ations that can be addressed with these programmablllty features. 

The packaging concept is intended to significantly reduce [he off page wire count far each of the 

<s dusters* This concept takes a duster which Is defined as a 8 X 8 array of nodes 820. each node 825 having 
8 processing elements for a total ot 512 PMEs, then to limit the X and Y ring wHhin the cluster and. finally, 
to bring out the W and Z buses to all clusters. The physical picture could be envisioned as a sphere 
configuration 800, $10 of 84 smaller spheres 830. See FIGURE 15 for a future packaging picture which 
illustrates the lufl up packaging technique, limiting the X and Y rings 800 wnhin the duster and extending 

so out the W and Z buses to all clusters 810, The physical picture could be envisioned as a sphere 
configuration of 64 smaller spheres ©0. 

The actual connection of a single node to the adjocem X and Y neighbors 975 exists within die same 
cluster. The wiring savings occurs when the Z and w buses are extended to the adjacent neighboring 
clusters as Illustrated In FIGURE 18. Also illustrated In FIGURE ie is the set of the chips or nodes that can 

as be configured as a sparsely connected ^dimensional hypercube or torus $00, 005 r $10, Of 5. Consider each 
of the 8 external port3 to be labeled as +X. +Y, +Z« + w, -x, -Y, -w 880. 975. Then, using m chips 1 e 
ring can be constructed by connecting the +x to -X ports. Again m such rings can be Interconnected into a 
ring of rings by interconnecting the matching + Y to -Y ports. This levei of structure will be called a cluster. 
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It provides for 512 PMEs and will be the building block lor several s«e systems. Two such connections 
(950, 975) an? shown In the diagram illustrated In FIGURE 16. 

Applications lor Deskside MPP. 

s 

The deskside MPP in a workstation can be ef actively applied in several application areas Including: 

1, Small production tasks that depend upon compute intensive processes. The US Postal Service 
? squires a processor that can accept a fax Imafle of a machine printed envelope and then find and read 
the code. The process Is needed at all regional sort Jaclllttes and Is an example of a very repetitive 

io but sift compute intensive process. We have implemented APL language versions of a sample of the 
required programs. Tft ese models emulate the vector and array processes that will be used to do the 
work on the MPP, Based upon this test, we know that the task te an excellent match to the processing 
architecture. 

2, Tasks in which an analyst, as a result of pilot output, or expected needs requests sequences of data 
55 transformations. In an example taken from the Defense Mapping Agency, satellite images are to be 

transformed and smoothed pixel by pixel Into some other coordinate system. In such a sttaaHon. the 
transformation parameters for the image vary across localities as a result of ground elevation and slope. 
The analyst must therefore add fixed control points and reprocess transformations. A similar need occurs 
in the utilization of scientific simulation results when users require almost real time rotation or perspeo 
s> tive changes. 

3, Program development for production versions of ihe MPP will use workstation size MPPs. Consider a 
tuning process that requires analysis ol processor versus network performance. Such a task is machine 
and antfysf interactive. fi can require hours when the machine i$ kite and the analyst is working. When 
performed on a supercomputer it would be very costly, t-fowever, providing an affordable workstation 

29 MPP with the same (but scaled) characteristics as the supercomputer MPP eliminates costs and eases 
the test and debug process by eliminating ihe programmer inefficiencies related to accessing remote 
processors. 

FIQJRE 22 Is a drawing of the workstation accelerator. It uses the same sl2e enclosure as the 
Rtsc/6000 model 530. Two swing out gates, each containing a fun cluster ere shown. The combined two 

30 clusters provide 5 GOPS of fixed point performance and 530 MftopS of processing power and about 100 
Mbyie/s of I/O bandwidth to the array. The unit would be suitable for any of fte prior applications. WHh 
quantity production and including a host RI8C/6000, ft would be price comparable with high performance 
workstations, not at the price of comparable machines employing old technology. 

33 Description of the AWAC3 Sensor Fusion 

The roiltoy environment provides a series of examples showing the need for a hardened compute 
Intensive processor. 

Communication in the targeted noisy environments Implies the need for digitally encoded communica- 
lions, as is used m ICNIA systems. The process of encoding the date for transmission end recovering 
information after receipt is a compute intensive process. The task can be done wilh specialized signal 
processing modules, but for situations where communication encoding represents bursts of activity, 
specialized modules are mostly idle, Using the MPP permits several such tasks to be allocated to a single 
module and saves weight, power, volume and cost. 

«ss Sensor data fusion presents a particularly clear example of enhancing an existing p&attorm with the 
compute power gained from the addition of MPP, On fte Air Force E3 AWACS there are more than tour 
sensors on the platform, but there le currenUy no way to generate tracks resulting from the Integration of ail 
available data. Further, the exisfing generated tracks have quite poor quality due to sampling characteristics. 
Therefore, there is motivation to use fusion to provide an eftacfive higher sample rate, 

so We have studied this sensor fusion problem in detail and can propose a verifiable and effective solution, 
but fhat solution wouW overwhelm the compute power available in an AWACS data processor. FIGURE 23 
shows the traditional track fusion process. The process is faulty because each of the Individual processes 
tends to make some errors and the final merge lends to colled them Instead of eliminating them. The 
prooass is also characterised by high time latency in that merging does not complete until tha slowest 

so sensor completes, FIGURE 24 presents an improvement and the resulting compute intensive problem with 
the approach. Although we cannot solve a NP-Hard problem. we have developed a good method to 
approximate the solution. While me details of that application are being described by the inventors 
elsewhere, as it can be employed on a variety of machines like an Intel Touchstone with 512 $60 (80800) 
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processors, and IBM's Scientific Visualization System, il can be used as an application suitable lor the MMP 
using me APAP design described here with Day 128,000 PMEs, substantially outperforming ihese other 
systems. Application experiments show the approximation quality i* below the level of sensor wise end a© 
such the answer is applicable* to applications iike AWACS- FIGURE 25 shows the processing loop involved 

5 in (lie proposed Lagrangean Reduction n-tfmensional Assignment algorithm. The problem uses very 
controlled repetitions of the wen known 2-tfmen$k>nal assignment problem, the came algorithm that 
classical sensor fusion processing uses. 

Suppose for example tf>at the n-dlrnenslooal alo^rtthm was to be applied to the seven sets of 
observations illustrated in FIGURE 24 and further, suppose that each pass through a reduction process 

io required four iterations through a id Assignment process. Then the new 6d Assignment Problem would 
require 4000 iterations of the 2d Assignment Problem. Tne AWAC&' workload $ now about 90% of machine 
capacity. Fusion perhaps requires 10% of the total effort, but even that smart effort when scaled up 4000 
tirrvee results in tolal utilization bemg 370 times ihe capacity of an AWACS. Not oniy does this workload 
overwhelm trie existing processor, but it would be marginal In any new MIL environment suited, coarse- 

;s grained, parallel processing system currently existing or anticipated in the next few years. K the algorithm 
required an average of 5 rather than 4 iterations per step, then it would overwhelm even the hypothesized 
systems. Conversely, the MPP solution can provide the compute power and can do so even at the 6 
iteration level. 

so Mechanical Packaging 

As illustrated In FIGURE 3. and other FIGURES, our preferred chip is configured In a quadflalpaek form. 
At such it can be brickwaiied into into various 2 O and 3 D oonf iguranons in a package. One chip of eight or 
more processor memory elements is a first level package module, the same as a single DRAM memory 

£4 chip 1$ to a foundry which package? the chip. However* It Is In a quadfletpack form, allowing connections to 
one another in four directions. Ea&h connection is point fo point. (One chip in its first level package is a 
module to (he foundry.) We are arte to construct PE arrays of sufficient magnitude to hit our performance 
goals due to this feature. The lealSy Is mat you can connect these chips across 3, 4 or even five feet, point- 
to-point, i,e. multi-processor node to node, and stilt have proper control without the need of fiber opacs. 

30 This has an advantage for the drtvatocelve circuits that are required on the modules. One can achieve 
high performance and keep the power dissipation down because we do not have bud systems that daisy 
chain from module to module. We broadcast from node to node, but this need not be a high performance 
path. Most data operations can be conducted in a node, so data path requirements are reduced. Our 
broadcast path is essentially primarily used as a controller routing tool. The data stream attaches to and 

35 runs in, the ZWXY communication path system. 

Our power dissipation is 2.2 watts per node module for our commercial workstation. This allows us to 
use sir cooled packaging. The power system requirements for our system are eiso reasonable because of 
this fact. Our power system illustrated multiplies the number of modules supported by about 2.5 watts per 
module, and 9uch a five volt powei suppiy is very cost effective. Those concerned with the amount of 

40 electricity consumed would be astonished that 32 microcomputers could operate with tee* than the wetfage 
consumed by a reading light. 

Our thermal design is enhanced because of die packaging. We avoid hot spots due to high dissipating 
parts mixed with low dissipating ones. This reflects directly on the cost of the assemblies. 

The cost of our system is vary attractive compared io the approaches that put a superscalar processor 

*$ on a card- Our performance level per assembly per watt oer connector per part tyoe per dollar ts excellent 
Furthermore, we do not need the same number of packaging levels that the othsr technology does. We 
do not need module/card^backplane and cable. We oan skip the card level if we want to. As illustrated in our 
workstation modules, we have skipped the card level wqh our brickwaiied approach. 

Furthermore, as we illustrated in our layout, each node housing which is brickwaiied in the workstation 

so modules, can as illustrated In FIGURE 3 comprise multiple replicated dies, even within ihe same chip 
housing. While normally we would place one die within an air cooled package, it is possible to place 8 die 
on a substrate using a multiple chip module approach. Thus, the envisioned waich with 32 or more 
processors. Is possible, as well as many oiher applications. The packaging and power and flexibility make 
applications which are enoiees, A house ooutd have its controllable instruments efl watched, and cooidi- 

56 naled with a very small pari Those many chips thai are spread around an automobile for engine watching, 
brake adjustment, and so on ooutd all have a monitor within a housing. In addition, one the same substrate 
with hybrid technology, one could mount a 388 microprocessor chip with full programmable capability and 
memory (all in one chip) and use it as the array controller for ihe substrate package. 
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We have shown meny configurations or systems, from control systems, FIGURE 3, to larger and larger 
syetems. The ability to package & chip with multiple processor memory element of eight or more on a chip 
in a (lip, with pinouts frifing in a standard DRAM memory module* such as in a SIM module make possible 
countless additional applications ranging from carrtrote to wall size video displays which can have a 

s repetition rate, not a the 15 or so frames that press the existing technology today, but at 30 frames* wltfi a 
processor asetoned to monitor a pixel, or a node only a few pixels. Our brtckwall quadHatpacfc makes ft easy 
to replicate the same part time over and over again. Furthermore, the replicated processor is really memory 
wHh processor interchange. Part of the memory can be assigned to a specific monitoring task, and anoliar 
part (with a size prograinrriaticaJry defined) can be a massive global memory, addressed point-to-point, with 

jo broadcast to all capability. 

Our basic workstation, our* supercomputer, our controller, our AWACS, an are examples ot packages 
that can employ our new technology. An array of memory, witti inbuilt CPU chips and to, functions as a 
PME of massively parallel applications, and even more Smitod applications. The llexibiKty of packaging and 
programming makes Imaginations expand and our technology allows one part to be assigned to many Ideas 

js and images. 

Military Avionics Appficajtons 

The cost advantage of constructing a MIL MPP Is particularly well Illustrated by the AWACS. U is a 20 
20 year old enclosure that has grown empty space as new technology memory modules have replaced the 
original core memories. FIGURE 28 shows a MIL quotable two clusrer system that would fit directly into 
the rack's empty space and would use the existing memory bus system for interconnection. 

Although ihe AWACS example is very advantageous due to the existence or empty space, in other 
systems it b possible to create space. Replacing existing memory with a small MPP or gateway to an an 
aj Isolated MPP 1$ normally Quite viable. In such cases, a quarter duster and a adapter module would result In 
a 8 Megabyte memory plus 640 mips and use perhaps two slots. 

Supercomputer Application 

» A 64 cluster MPP is a 13.6 Gftop supercomputer. It can be configured in a system described in FIGURE 
27. 

The system we describe allows node chips to be brick walled on cluster cards as illustrated m FIGURE 27 
to bu3d up ay slams with some significant coat and size advantages. There la no need lo include extra chipa 
such as a network switch in such a system because it would increase costs- 

as Our interconnection system wtih "brickwalled" chips allows systems to be built like massive ORAM 
memory is packaged and wsi have a defined bus adapter conforming to (he rigid bus specifications, for 
instance a MicroChannel' bus adaptor. Each system will have a smaller power supply system and cooling 
design than other systems based upon many modem microprocessors. 

Unlike most supercomputers our current preferred APAP with floating point emulation Is much faster In 

40 Integer arithmetic (164 GIPS) than it is when doing floating point arithmetic. As such, the processor would 
be most effective whan used in applications that are very character or integer intensive. We have 
considered three program challenges which in addition to the other applications tfscussed herein are 
needtai of solution- The applications which may be more important ftan some of fhs "grand challenges' 1 to 
day lo day fife include: 

<B 1, 3090 Vector Processors contain a very high performance floating point arthmette unit- That unit, as do 
most vectorized floating point units, requires pipeline operations on dense vectors. Applications that 
make extensive use of non-regular sparse matrices {i.e. matrices described by bit maps rather than 
diagonals) waste the performance capability of the floating point unit The solves this problem by 
providing the storage for the date and using Its compute power and network bandwidth, not to do the 

so calculation but rather lo construct dense vectors, and to decompress dense results. The Vector 
Processing Umt is kept busy by a continual flow of operations on dense vectors being supplied to ft by 
the MPP, By sizing the MPP so mar it can effectively compress and decompress at the same rate the 
Vector Facility processes, one could keep both unite fully busy. 

2, Another host attached system we considered is a solution to the F6I fingerprint matching problem, 
as l-tere, a machine with mora than 64- clusters was considered. The problem was to match about 6000 
fingerprints oar hour against the entire database of fingerprint history. Using rnassive DA$D and ihe full 
candwidft of the MPP to host attachment, one can noil the complete data base across the incoming 
prints in about 20 minutes. Operating about 75% of the MPP in a 8IMD mode coarse matching 
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operation, balances processing to required throughput rats. W*e estimate that 15% of the machine in A- 
SIMD proceeding mods would ihen complete the matching by doing the detailed verification of unknown 
print versus file print for cases passing the coarse lifter operation. "The remaining poitions of the machine 
were in MIMD mode and allocated to reserve capacity, work queue management and output formatting. 
5 3. Application of the MP P to database operations has been considered. Although the work Is very 
preliminary, H doss seem to be a good match. Two aspects of the MPP support this premise: 

a. The connection beiween a cluster Comrolter and the Application Processor Interface is a Micro- 
Channel. As such. It could be populated with DASD dedicated to the cluster end accessed directly 
from the cluster, A 64 duster system with she 640 Mbyte hard drives attached per cluster would 

10 provide 246 Gbyte storage. Further, that entire database could be searched sequentially in 10 to 20 

seconds. 

b. Databases are generally not searched sequentially. Instead they use many levels of pointers, 
indexing of databases can be done within the cluster. Each bank of DASD would be supported by 2.5 
GIPS of processing power and 32 Mbyte of storage. That is sufficient for both searching and storing 

;s the indices. Since indices are now frequently stored within the DASD, significant performance gains 

would occur. Using such an approach and cSspersIng DASD on SCSI Interfaces attached to the cluster 
MicroChannel permits effectively unlimited size databases. 
FIGURE 27 is an illustration of the APAP when used to build the system into a supercomputer scaled 
MPP. The approach reverts to replicating units, but here It is enclosures containing 16 dusters that are 
3> replicated. The particular advantage of this replication approach fce that the system can be scaled to suit the 
user's needs. 

System Architecture 

20 An advantage of the system architecture which is employed in the current preferred embodiment la the 
ISA system which will be understood by many who will form a pool for programming the APAP. The PME 
ISA consists ol the following Data and Instruction Formats, illustrated <n the Tables. 

Data Formats 

30 

The basic (operand) size is the 16 bit word. In PME storage, operands are located on integral word 
boundaries. In addition to the word operand size, other operand sizes are available in multiples of 16 bits to 
support addHJonal functions. 

VWthtn any of the operand lengths, the bit positions of the operand are consecutively numbered from 
3S left to right starting w'Sh the number 0. Reference to high-order or most-significant bits always refer to the 
lelKnoet bit positions. Reference to the lowKtfder or least-significant bits always refer to ihe right-most bit 
positions. 

Instruction Formats 

40 

The length of an instruction format may either be 16 bits or 32 bits. In PME storage, instructions must 
be located on a 16 bit boundary. 

The following general Instruction formats are used. Normally, the first lour bits of an Instruction define 
the operation code and are referred to as Ihe OP bits. In some cases, additional bits are required to extend 
*s tho definition of the operation or to define unique oondWons which apply to the instruction. These bfts are 
referred to as OPX bits. 
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Format Code 


Operation 


RR 


Register to Register 


OA 


Krect Address 


R8 


Register Storags 


Rl 


Register Immsdate 


SS 


Storage to Storage 


SPC 


Special 



Alt formats have one field in common. This field and its interpretation is: 

Bit* 0»3 Operation Code - This field* sometimes in conjunction with an operation code extension 
>5 field, defines the operation to be performed. 

Detailed figures of the Individual formats along with interpretations of their fields ate provided in the 
Mowing subsections. Foi some instmciiona, two formate ma/ bo combined to form variations on the 
instruction. These primarily involve the addressing mode for the instruction. As an example a storage to 
storage Instruction may hawe a form which Involves direct addressing or register addressing. 

FIR Format 

The Register-Register (RR) format provides two general register addresses and k 16 bits *m length as 
shown. 

29 



OP 


RA 


G 9 0 0 


RB 


\ 1 1 


-1 1 1 


1 1 1 


\ I i 



G 3 4 7 S 11 1 

1 2 5 



In addition to an Operation Code field* the RR format contains: 

Bits *7 Register Address 1 - Tl>e RA. field is used to specify which of the 16 general registers is 

to bo used as an operand and.'Or destination. 
Bits W1 Zeros - Bit 6 being a zero defines the fctrnat to be a RR or DA forme* and b«s 

equal to zero define the operation to b« a register to register operation <a special case or 

the Direct Address formal). 
Bits 12-15 Register Address 3 - The RB field is used to specify which of the 16 general registers is 

to be used as an operand. 



49 DA Format 



The Direct Address <DA) formal provides one general register address and one dlreot storage address 
as shown. 
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OP 


RA 


0 


D1R AODR 


1 1 1 


1 i 1 




1 1 1 1 1 1 



9 3 4 7 8 9 1 



In addition to an Operation Code fold, the DA format contains: 

Bits 4-7 Register Addrt&e 1 - The RA fold is used lo specify which of the 16 general registers to 

to be used as an operand andtor destination. 
Bfi 8 Zero - This bit being zero defi nes the operation to be a direct address op«reiion or a 

register to register operation. 
Bite 9-15 Direct Storage Address - The Direct Storage Address field is used as an address iato 

the level unique storage block or the common storage block. Bits 9-11 of the direct 

address Held must be non-zero lo define the direct address form. 

RS Format 

The Register Storage <RS) formal provides one general register addresses end an indiecr storage 
address. 



j 

! op 

illl 


RA 

1 f 1 


J 


OEt 

. J_ L 


R8 
..I..I L . 











9 34 78911 1 



In addition to an Opeiatfcm Code Held, the RS format contains: 

Bits 4-7 Register Address 1 - The RA field is used to specify which of the 18 general registers is 

to be used as an operand end/or destination. 
BR 8 One - Thie bit being one definee the operation to be a raster storage operation. 

Bite $-11 Register Data - These bite are considered a signed value which is used to modify the 

contents ot register specified by the RB field. 
Bits 12-15 Register Address 2 - Tne RB fold is used to specify which the 16 general registers te 

id be used as an storage address for an operand. 

Rl Format 

The Register-Immediate (Rl) format provides one general register address end 16 bits of immediate 
data. The Rl format ia 32 bite of tenglh as shown: 
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OP 


ftA 


1 


DEL 


0 6 0 0 


IMMEDIATE DATA . 


1 \ 1 


..III 






' « 1 


.1.1 I 1 1 1 > 1 1 



70 



3 4 



1 8 9 



1 1 

1 Z 



1 

5 



3 
1 



m addition lo an Operation Coda field, the Rl tor mat contains: 
Bits 4-7 



20 



{9-11 



1 12-16 



18-31 



29 S$ Format 



Register Address 1 * The RA field is used 10 specify which of the 10 general registers 1$ 
to be used as an operand and.br destination. 

One • This bit being one deflnec the operation lo bo a register storage operatton* 
Register Data - These bits ere considered a signed value which is used to modify the 
contents of the program counter. Normally.thls field wcufd have a vahie of one tor the 
register immediate format 

zeroes - The beinQ zero i* used lo specify *at the updated program counter, which 
points to the immediara data Held, is to be used es en storage address for an operand. 
Immediate Data * This flew serves as a 16 bit Immediate data operand for Register 
Immediate instiuciions. 



The Storage to Storage (3S) format provides two storage addresses, one explicit and she second 
Implicit The Implied storage address is contained bi General Register 1- Register 1 i$ modified during 
execution of the instruction. There are two forma of a SS instruction, a direct address form and a storage 
address form. 
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(Direct 




OP 




0 


C 


0 


01 R ADDR 


Address 






0 


V 


R 






Form) 




1 1 1 


p 

1 


F 


Y 




1 I I i i I 






0 3 


4 




7 8 9 1 





SO 
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0PX 


























(Storage 


OP 




0 


c 


l 


DEL 


Re 


Address 




0 


V 


R 








Form) 




p 


r 

r 


y 










1 1 










( i 


1 f 1 





3 4 



7 8 9 1 1 
1 Z 
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70 



IS 



In addition in on Operation Code field, the 33 formai contains: 

Bite 4*7 Operation Ex tension Code - The OPX Held, together with the Operation Code, defines 
the operation to be performed. Bits 4-5 define the operation type such as ADD or 
SUBTRACT. BHs 6-7 control the carry, overflow, end how the condition code will be set. 
Bit 3 s 0 Ignores overflow, bit 6 = 1 allows overflow. Bit 7 » 0 ignore the carry stat 
durino the operation; bit 7 » 1 includes ihe carry stat during the operation. 

Bit 8 Zero - Defines the form to be a direct address form. One - Defines me form to be a 

storage address form. 

Bits 9-15 Direct Address (Direct Address Form} - The Oirect Storage Address field is used as 
an address into Ihe level unique storage block or She common storage bloc*. Gits 9-1 1 of 
tfie direct address field must be nonzero to define the direct address form- 
Bits 9-11 Register Delta (Storage Address Form) • These bits are considered a signed value 

which is used to modify the contents of register specified by the RD field. 
Bits 12*15 Register Address 2 (Storage Address Form) * The RB field Is used to specify which 
of the 1 6 general registers is to be used as a storage address for an operand. 



SPC Format 1 

The Special (SPC1) formal provides one genera* register 9torage operand address. 
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OP 


OPX 


0 


LEN 


m 


RR Porta 


1 1-L- 


1 






..i i i 





3 4 



7 8 



1 1 
1 2 



1 
5 





OP 


OPX 


l 


LEN 


RB 


35 


1 II 


! i r 




1 1 


..III 



3 4 



7 8 



1 1 
1 2 



RS Form 



In addition to an Operation Code field, the SPC1 format contains: 



30 



Bits 4-7 
Bit 8 

Bits W1 



Bits 12-15 



SPC Format 2 



OP Extension - The OPX Held is used to extend the operaflon code. 

Zero or One - This bit being zero defines the operation to be a register operation. This 

bit being one defines the operation to be a register storage operation. 

Operation Length - These bite ere considered an unsigned value which is used to 

specify the length of the operand in 16 bri words. A value of zero corresponds to a length 

o? one, and a value of ET11 V corresponds to a length of eight 

Register Address 2 - The RB field Is used to specify which of the 19 general registers 

to be used as a storage address for the operand. 



The Special (SPC2) format provides one genera? register storage operand sddress. 
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OP 

L 1 1 


RA 

i 1,1.. 


OPX 


RB 

Ml. 


0 3 


4 7 


8 1 


1 1 



1 2 



30 



tn addition to en Operation Cods field, the SPC2 formal contains: 

Bits 4-7 Register Address 1 - The RA lleld is used to specify which of the 16 general registers is 
to be used as an operand end/or destination. 
75 Bite 6-11 OP Extension - The OPX field te ussd to extend ihe operation code. 

Bits 12-15 Register Address 2 - The RB field is used to specify which of the 16 general registers is 
to be used as a etorerje address for tlte operand* 

THE INSTRUCTION LIST OF THE ISA INCLUDES THE FOLLOWING: 



Table 1 {Page 1 <rf 3). Fixed. Poin t Ariihrneiic Intfructtor* 

NAME 

ADD DIRECT 



MNE- 
MONIC 

ada 



TYPME 

DA 



33 



40 



<*8 



SO 



65 



50 
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Table 1 [Page 1 tA 3>. Fixed-Point Arilhroetic iratrusSoru 



NAME MNS» TYPME 





ADD FROM STORAGE 


a 


RS 




(WITH DELTA) 




RS 




AOD IMMEDIATE 


Si 


Rl 


10 


(WITH DELTA) 


alwd 


Rl 




ADO REGISTER 


ar 


RR 




COMPARE DIRECT ADDRESS 


cda 


DA 


)5 


COMPARE IMMEDIATE 


ci 


Rl 




(WITH DELTA) 


dwd 


Rl 




COMPARE FROM STORAGE 


c 


RS 


3) 


(WITH DELTA) 


cwd 


RS 




COMPARE REGISTER 


cr 


RR 




COPY 


cpy 


RS 


5$ 


(WITH DELTA) 


cpywd 


RS 




COPY WITH BOTH IMMEDIATE 


cpybi 


Rl 




(WITH DELTA) 


cpyhiwd 


Rl 




COPY IMMEDIATE 


cpyl 


Rl 


30 


(WITH DELTA) 


cpyiwd 


Rl 




COPY DIRECT 


cpyda 


DA 




COPY DIRECT IMMEDIATE 


cpydai 


DA 


35 


INCREMENT 


inc 


RS 




(WITH DELTA) 


Incwd 


RS 




LOAD DIRECT 


Ida 


DA 


<?0 


LOAD FROM STORAGE 


I 


RS 




(WITH DELTA) 


iwd 


RS 




LOAD IMMEDIATE 


H 


Rl 


** 


(WITH DELTA) 


llwd 


Rl 




LOAD REGISTER 


Ir 


RR 




MULTIPLY SIGNED 


mpy 


SPC 


30 


MULTIPLY SIGNED EXTENDED 


ropyx 


SPC 


MULTIPLY SIGNED EXTENDED IMMEDIATE 


mpyxi 


SPC 
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30 



23 



30 



Tabi« 1 (Pa&e 3 of 3|, F<xed«Poi<M Arijhmsnc fnsttvctions 
NAME 

MULTIPLY SIGNED IMMEDIATE 

MULTIPLY UNSIGNED 

MULTIPLY UNSIGNED EXTENDED 

MULTIPLY UNSIGNED EXTENDED IMMEDIATE 

MULTIPLY UNSIGNED IMMEDIATE 

STORE DIRECT 

STORE 

(WITH DELTA) 
STORE IMMEDIATE 

(WITH DELTA) 
SUBTRACT DIRECT 
SUBTRACT FROM STORAGE 

WITH DELTA) 
SUBTRACT IMMEDIATE 

{WITH DELTA) 
SUBTRACT REGISTER 
SWAP AND EXCLUSIVE OR WITH STORAGE 



MNE- 
MONIC 

mpyi 

mpyu 

mpyux 

mpyuxi 

mpyut 

stda 

st 

stwd 
sti 

stiwd 
sda 

5 

swd 
si 

sr 



TYPME 

SPC 

SPC 

SPC 

SPC 

SPC 

DA 

RS 

RS 

R! 

Rl 

DA 

RS 

RS 

RI 

Rl 

RR 

RR 
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Table? (Paga I Ol 3). Siorage to Storage In&lrucibns 
NAME 

ADD STORAGE TO STORAGE 

(WITH DELTA) 
ADD STORAGE TO STORAGE DIRECT 
ADD STORAGE TO STORAGE FINAL 

(WITH DELTA) 
ADD STORAGE TO STORAGE FINAL DIRECT 
ADD STORAGE TO STORAGE INTERMEDIATE 

(WITH DELTA) 



MONIC 

sa 

sawd 

aada 

saf 

saiwd 

safda 

sai 

sabvd 



TYPME 

SS 
SS 
SS 
SS 
SS 
SS 
SS 
SS 



63 
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Table 2 (Page 2 of 3). Storage so Storage iftsln^tions 

NAME MNE- TYPIWE 

MONjC 

ADO STORAGE TO STORAGE INTERMEDIATE 





DIRECT 


saida 


SS 




ADD STORAGE TO STORAGE LOGICAL 


sal 


SS 


10 


(WITH DELTA j 


salwd 


SS 




ADO STORAGE TO STORAGE LOGICAL DIRECT 


salda 


SS 




COMPARE STORAGE TO STORAGE 


sc 


SS 


;s 


(WITH DELTA) 


scwd 


SS 




COMPARE STORAGE TO STORAGE DIRECT 


scda 


SS 




COMPARE STORAGE TO STORAGE FINAL 


scf 


SS 


20 


(WITH DELTA) 


scfwd 


SS 




COMPARE STORAGE TO STORAGE FINAL DIRECT 


scfda 


SS 




COMPARE STORAGE TO STORAGE INTERMEDIATE 


scl 


SS 


20 


(WJTH DELTA) 


sciwd 


SS 




COMPARE STORAGE TO STORAGE INTERMEDIATE 








DIRECT 


scida 


SS 




COMPARE STORAGE TO STORAGE LOGICAL 


scl 


SS 


30 


(WITH DELTA) 


sclwd 


SS 




COMPARE STORAGE TO STORAGE LOGICAL 








DIRECT 


sdda 


SS 


38 


MOVE STORAGE TO STORAGE 


smov 


ss 




(WITH DELTA) 


smovwd 


SS 




MOVE STORAGE TO STORAGE DIRECT 


amovda 


SS 


40 


SUBTRACT STORAGE TO STORAGE 


ss 


SS 




(WITH DELTA) 


S3Wti 


SS 




SUBTRACT STORAGE TO STORAGE DIRECT 


ssda 


SS 


*5 


SUBTRACT STORAGE TO STORAGE FINAL 




SS 




{WITH DELTA) 


ssfwd 


SS 




SUBTRACT STORAGE TO STORAGE FINAL DIRECT 


ssfda 


SS 


30 


SUBTRACT STORAGE TO STORAGE INTERMEDIATE 


ssi 


SS 


(WITH DELTA) 


ssiwd 


SS 
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30 



Tablo 2 {Pag© 3 of 3). Stoi^e ta Storage inductions 
NAME 

SUBTRACT STORAGE TO STORAGE INTERMEDIATE 
DIRECT 

SUBTRACT STORAGE TO STORAGE LOGICAL 

(WITH DELTA) 
SUBTRACT STORAGE TO STORAGE LOGICAL 

DIRECT 



MONIC 

saida 
ssl 

sslwd 
sslda 



ss 

SS 
SS 

ss 



Table 3 

no 



25 



40 



Logical Instructions 


NAME 


MNEMONIC 


TYPME 


AND DIRECT ADDRESS 


rata 


DA 


AND FROM STORAGE 


A 


RS 


(WITH DELTA) 


nwd 


RS 


AND IMMEDIATE 


nl 


Rl 


(WITH DELTA) 




Rl 


AND REGISTER 


nr 


RR 


OR DIRECT ADDRESS 


oda 


DA 


OR FROM STORAGE 


0 


RS 


(WITH DELTA) 


owd 


RS 


OR IMMEDIATE 


Oi 


Rl 


(WITH DELTA) 


oiwd 


Rl 


OR REGISTER 


or 


RR 


XOR DIRECT ADDRESS 




DA 


XOR FROM STORAGE 


X 


RS 


(WITH DELTA) 


xwd 


RS 


XOR IMMEDIATE 


XI 


Rl 


(WITH DELTA) 


xhvd 


Rl 


XOR REGISTER 


XT 


RR 



<*5 
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TaWe 4 JPaea 1 or Ty Shin Insirxictions 

NAME TYPME 

SCALE BINARY scale S PC 

SCALE BINARY IMMEDIATE scale? SPC 

SCALE BINARY REGISTER scaler SPC 

10 SCALE HEXADECIMAL scaleh SPC 

SCALE HEXADECIMAL IMMEDIATE scalefci SPC 



/5 


SCALE HEXADECIMAL REGISTER 


scatehr 


SPC 




SHIFT LEFT ARITHMETIC BINARY 


sla 


SPC 




SHIFT LEFT ARITHMETIC BINARY IMMEDIATE 


slal 


SPC 


20 


SHIFT LEFT ARITHMETIC BINARY REGISTER 


star 


SPC 




SHIFT LEFT ARITHMETIC HEXADECIMAL 


slab 


SPC 




SHIFT LOFT ARITHMETIC HEXADECIMAL IMMEDIATE 


$lahi 


SPC 


2$ 


SHtFT LEFT ARITHMETIC HEXADECIMAL REGISTER 


sJahr 


SPC 




SHIFT LEFT LOGICAL BINARY 


*tl 


SFC 




SHIFT LEFT LOGICAL BINARY IMMEDIATE 


slli 


SPC 


30 


SHIFT LEFT LOGICAL BINARY REGISTER 


sllr 


SPC 




SHIFT LEFT LOGICAL HEXADECIMAL 


sllfi 


SPC 




SHIFT LEFT LOGICAL HEXADECIMAL IMMEDIATE 


SUM 


SPC 


35 


SHIFT LEFT LOGICAL HEXADECIMAL REGISTER 


sJEhr 


SPC 




SHIFT RIGHT ARITHMETIC BINARY 


sra 


SPC 




SHIFT RIGHT ARITHMETIC BINARY IMMEDIATE 


srai 


SPC 




SHIFT RIGHT ARITHMETIC BINARY REGISTER 


srar 


SPC 


40 


SHIFT RJGHT ARITHMETIC HEXADECIMAL 


srah 


SPC 




SHIFT RIGHT ARITHMETIC HEXADECIMAL IMME- 


srahl 


SPC 




DIATE 






4$ 


SHIFT RIGHT ARITHMETIC HEXADECIMAL REGISTER 


srahr 


SPC 




SHIFT RIGHT LOGICAL 8INARY 


srl 


SPC 


30 


SHIFT RIGHT LOGICAL BINARY IMMEDIATE 


srli 


SPC 
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Table 4 {Page 2 of 2). Shift lns<ruC|ion$ 

NAME MNE- TYPME 

MOMIC 

SHIFT RIGHT LOGICAL BINARY REGISTER srlr SPC 

SHIFT RIGHT LOGICAL HEXADECIMAL srlh SPC 

SHIFT RIGHT LOGICAL HEXADECIMAL IMMEDIATE srlhi SPC 

SHIFT RIGHT LOGICAL HEXADECIMAL REGISTER srlhr SPC 



»5 Table 5 (Paga 1 of 2). Branch Instructions 





NAME 


MONIC 


JYP 




DO 4 Mf*lJ 


u 


RS 




f\fJITM HP! Til s 


bwd 


BP 

R5 




BRANCH DIRECT 


bda 


OA 


29 


BRANCH IMMEDIATE 


bi 


Rl 


(WITH DELTA; 


biwd 


Rl 




BRANCH REGISTER 


br 


RS 


30 


BRANCH AND LINK 


bal 


RS 


BRANCH AND LINK DIRECT 


balda 


DA 




BRANCH AND LINK IMMEDIATE 


ball 


Rl 




{WITH DELTA) 


baliwd 


Rl 


33 


BRANCH AND LINK REGISTER 


balr 


RS 




BRANCH BACKWARD 


bb 


RS 




(WITH DELTA) 


bbwd 


RS 


40 


BRANCH BACKWARD DIRECT 


bbda 


DA 




BRANCH BACKWARD IMMEDIATE 


bbl 


Rl 




(WITH DELTA) 


bblwd 


Rl 


<*5 


BRANCH BACKWARD REGISTER 


bbr 


RS 




BRANCH FORWARD 


bi 


RS 




(WITH DELTA) 


bfwd 


RS 


SO 


BRANCH FORWARD DIRECT 


bfda 


□A 
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Ta«e 5 (Pege 2 of 2}. Branch tn&irwtioo-s 
NAME 

5 BRANCH FORWARD IMMEDIATE 
(WITH DELTA) 
BRANCH FORWARD REGISTER 
10 BRANCH ON CONDITION 
(WITH DELTA) 
BRANCH ON CONDITION DIRECT 
15 BRANCH ON CONDITION IMMEDIATE 
(WITH DELTA) 
BRANCH ON CONDITION REGISTER 
w BRANCH RELATIVE 

(WITH DELTA) 
NULL QPMERATION 

as 

Table 6 



30 



Steius Switching Instructions 


NAME 


MNEMONIC 


TVPME 


RETURN 


ret 


SPC 



Table 7 



InpupOutpui Instructions 


NAME 


MNEMONIC 


TVPME 


IN 


IN 


SPC 


OUT 


OUT 


SPC 


INTERNAL DIGrVDIQW 


1N7R 


SPC 



45 

SOME SUMMARY FEATURES 
The APAP Machine In Perspective 

so 

We have described in accordance with our Invention could be thought of in its more detailed aspects to 
be positioned In the technology somewhere between the CM-i and N-cube. Uke our APAP, tbe CM-1 uses 
a pojrri design lor the processing element and combines processing elemarrte with rnamory on the basic 
chip; The CM-1. however uses a 1 bit wide serial prccaBSor.tthlto ins APAP eertes will use a Id bit wWe 
35 processor. The CM series erf machines started wim 4K bits of memory pe» processor and has grown to 8 or 
16K bite voreus the 32K by 16 bits we have provided for the first APAP chip. The and fis follow-ons 
are strictly SlMD machines while (he CM-5 Is a hybrid. Instead of this, our APAP <#HI effectively use MIMD 
opersilno modes in oonjuncHon with simd modes when useful. While our parallel 16 bit wide PMEs might 



NINE- TYPME 
MONIC 

bfi Rl 

bfiwd Rl 

bfr RS 

be RS 

bewd RS 

beda RS 

bci Rf 

bdwd Rl 

bcr RS 

brel Rl 

brelwd RS 

noop RR 
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be viewed as a step toward toe N-cubs, this step is not warranted. The APAP does not separate memory 
and routing from the processing element as does the N-cube kind of machine. Aleo, the APAP provides For 
up to 32K 16 bit PMEs while the N-cube only provides for 4K 32 bit processors 

Even with the superficial similarities presented above, the APAP concept completely differs ircm the 
s CM end N-eube series by: 

1. The modified feypeicube incorporated in our apap is a new invention providing a significant packaging 
and addressing advantage when compared with hypercube topologies. For instance, consider that the 
32K PME APAP In its first preferred embodiment has a network diameter of 19 logical steps end. with 
transparency, this can be reduced to en effective 16 logical step*. Further, by oorwerfson, P fit pure 

70 hypercube were used, and if all PMEs were sending data through an 8 slep path, then on average 2 of 
every 8 pmes would be active while the remainder would be delayed due to blockage. 

Alternatively, consider the 64K hypercube that would be needed it CM-1 was a pure hypercube. In 
that case* each PME would require ports to 16 other PMEs, and data could be routed between the two 
feifresi separated PMEs in i& logical steps, it all PMEs tried to transfer an average flstance of 7 steps, 

ye the 2 of every 7 would be active. However, CM-1 does not utilize s 16d hypercube. It interconnects the 
1 6 nodes on a chip with a NEWS network: then it provides one router function within the chip. The 4096 
routers are connected into a 1 2d hypercube WHh no collisions the hybrid still has a logical diameter of 
15, but since 18 PMEs could be contending lor the tin* He effective diameter is much greater. That is, 
wilh 8 step moves only 2 or 16 PMEs could be active, which means that 8 complete cycles rather than 4 

20 cycles are needed to complete en data moves. 

The N-cube actually utilizes a pure hypercube, but currently only provides for a 4096 PMEs and 
thus* utilizes a i2d {13d for 8192 PMEs) hypercube For the N-cube to grow to 18K processors, at which 
point it would have the same processing data width as the APAP, it would have to add four limes as 
much hardware and would have to increase the connection ports to each PME router by 25%, Although 

29 no hard date exists to support this conclusion, it would appear that the N-cube architecture runs out of 
connector pins prior to reaching a 16K PME machine. 

2. The compteiely integrated and distributed nature of major tasks within the APAP machine is a decided 
advantage. As was noted for the CM and N-cube series of machines, each had to have separate units fci 
message routing as well as separate units for floating point coprocessors. The APAP system combines 

a> the Integer, floating point processing, message routing and I/O control into the single point design PME. 
That design is then repeated 8 Smes on a chip, and the chip is t*eo replicated 4K times to produce the 
array. This provides several advantages; 

a. Using one chip means maximum size production runs and minimal system factor costs. 

b. Regular architecture produces the most effective programming systems. 

39 a Almost all chip pins can be dedicated to fte generic problem of interprocsssor communication, 
maximizing the inter-chip I/O bandwidth which tends to be a Important limiting factor in MPP designs. 

3. The APAP has the unique design ability to take advantage of chip technology gains and capital 
investment in custom chip designs. 

Consider the question of floating point performance. It is enficipated that APAP PME performance on 

40 OAKPY will be about 125 cycles per flop. In contrast, the '387 Coprocessor would be about 14 cyctea 
while Uhe Weitec Coprocessor in the CM-1 would be about 8 cycles. However, in the CM case there is 
only one floating point unit for every 16 PMEs while in the N-cube case there is probably one *38? type 
chip associated wnh each of the *38S processor*. Our APAP has 16 times as many PMEs and mretore 
can almost completely make up for the single unit performance delta. 

<w More significantly, the 8 APAP PMEs within a chip are constructed from 50K gates currently 

available in the technology. As memory macros shrink and the number of gates available to the logic 
increases, Spending that Increase on enhanced floating point normalization should permit APAP floating 
point perfoimance to far exceed the other units. Alternatively, effort could be spent to generate a PME or 
PME subsection design using custom design approaches, enhancing total performance while in no way 
so affecting any S/W developed for the machine. 

We believe our de&gn for our APAP has characteristics poised to take advantage of the future 
process technology growth. In contrast, the nearest similar machines CM-x and N-cube which employ a 
system like that described in FIGURE 1 seem well poised to take advantage of yesterday's technology 
which we feel is dead ended, 
93 An advantage of (he APAP concept is the ability to use OASO associated with groups of PMEs. This 
APAP capability, as well as the ability to connect displays and auxiliary storage, is a by-product of picking 
MC bus structures as the interface to the external w> poris of the PME Array, Thus, APAP systems will be 
configurable and can include card mounted hard drives selected from one of the set of units that are 
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compatible vrfth P&2 or RISC/6000 units. Further, that capability shoufcl be available without designing any 
additional part number modules although it does require utlDdng more repJlcatbrw of the backpanel and 
ba$e enclosure than does the apar 

Thte brief perspective is not intended to be limiting, but rather is intended to cause those stilled in the 

6 art to review the foregoing description and examine how the many Inventions we have described which may 
be used fo move the an of massively parallel systems ahead to a time when programrrtng la no longer a 
significant problem and the costs of such systems are much lower. Our kind of system can be made 
available, not only to the few, but to many as It could be made at a cost withto the reach of commercial 
department level procurements. 

w White we have described our preferred embodiments of our invention, K win be understood that those 
skilled in the art both now and in fte future, upon the understanding of these discussions will make various 
improvements and enhancements thereto which fall within the scope of the claims which follow. These 
claims should be construed to maintain the proper protection for the invention first disclosed. 

js Claim* 

1. A computer system comprising, a plurality of multi-processor memory elements, each having <u>rrtmu- 
nlcatfon paths, processor and memory, and wherein a programmable router is provided wr routing data 
and control Information from one multt-orocessor memory element to another multi-processor memory 

so element and between nodes of the computer system. 

2. A computer system according to claim i wherein each multt-orocesaor memory element <pme> has 2n 
processors, and communication paths which minimize delays due to chip crossings, 

2$ & A computer system according to delm 1 wherein each multi-processor memory element (PME) has a 
processor, memory and muters wan in a single chip and internal and external communication paths 
which minimize delays due to chip crossings, each processor memor/ element having means for fixed 
and floating point processing . routing and I/O control 

so 4. A computer system according to claim t further comprising within a processor memory element: 

a neiive instruction set means for providing an expandable multiply function, a programmer router for 
routing information alternatively tefcrighi, 

NEWS matrix, NEWSajp-dowrv hypercube, and wherein said programmable router is employs a 
hardwired distributed router provided by each processing memory element. 

3S 

5. A computer computer system according to claim 1 organized as a massively parallel machine with 
nodes interconnected as a n dimensional network cluster with parallel communication paffia between 
processor memory elements along said internal and external communication pains providing a process- 
ing a«ey» and wherein processing memory elements of an array have a transparent mode utilized when 

40 routing date between processing memory elements witJiin a chip eat of processing memory elements 
for permitting reduciion of the effective network diameter of a network of nodes, 

6. A computer system according to claim i wherein a node of a processor array has murnpte single 
processor elements made up of 32K 13-bit words with a 16-bil processor for a network node of eight 
processors with their associated memory with their tolly distributed I/O routers and signal I/O ports. 

7. A computer system according to claim 1 wherein a node of a processor array has multiple single 
processor elements made up of 32K i<H>it words with a i£-b'rt processor for a network node of eight 
processors with their associated memory with their fully distributed l ; 0 routers and signal W> ports 

so combined as groups groups of node clusters organized as a 2d modified hypercube, 

&. A computer sysrero according to claim i wherein a node oi a processor array has muhipte single 
processor elements made up of 02K 16-bil words with a 13-bU processor for a network node of eight 
processors with their associated memory with their fully distributed I/O routers and signal K> porta 
so combined as groups groups of node clusters organized as a 2d modified hypercube, with up to 64 
clusters integrated In a network of node clusters to form are integrated to form a 4d modified 
hypercuce of up to 32,738 processing memory elements. 



59 



EPO570 729 A2 



9* A computer system according to claim 1 wherein a nod* processing memory element has Internal data 
flows using high speed hard registers to feed distributed ALU and and I/O router registers and logic for 
an operations, 

s IQl A computer system according to claim 1 wherein a node processing memory element In has an VO 
port for on crop byte wide oonrmunfcation, and has input ports that are connected such thet data may 
be routed from input to memory, or from an input address register to an output register via a direct 
parallel data path. 

70 11. A computer system according to claim 1 wherein a node has multiple processor memory elements and 
is connected to other nodes In a clustet network with data routing distributed between hardware and 
software, with software controlling most ot the task sequencing function, 

12. A computer system according to claim 1 wherein a node has multiple processor memory elements and 
j 5 is connected to other nodes In e duster network with data routing distributed between hardware and 

software, with hardware provided tor performing Inner loop transfers and minimizing overhead on the 
outer loops of the node. 

13. A computer system according to claim 1 wherein a node has multiple processor memory elements and 
20 is connected to other nodes in a duster network with uo programs at dedicated interrupt levels for 

managing the network, 

14* A computer system according to claim 1 wherein a node has multiple processor memory elements and 
is connected to other nodes in a cluster network with I/O programs at dedicated interrupt levels for 
25 managing the network, each processor memory element having interrupt registers and dedicating lour 
interrupt levels io receiving data from four neighbors, & buffer provided at each level by loading 
registers at the level, and hawing in and return instruction pairs using a buffer address and transfer 
oourat to enable the processor memory element to accept words from an Input bus and to store mem to 
the buifer. 

30 

A mufti-processor memory system, comprising: a plurality of m lift-processor memory elements, each 
multi-processor memory element (PME) having 2n processors, memory and routers within a single chip 
and internal and external communication paths which mlnlmiie delays due to chip crossings. 
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FIG.1A 

Prior Art 




61 



EP 0 670 729 A2 



CONFIGURATION 

REGISTER k 
TIMING CONTROL 



EXTERNAL 
MEMORY 
INTERFACE 




F1G.1B 

Prior Art 



62 



EP 0 570 729 A2 



foooodoofl "oooooo<vo 
ooooopoo OOOOOOOO/ 

oooooboo oooooooo' 



mSoDOOOOV>C 

jooooooio 
loooooodo 

ooooootoo 
00000090 

OOOOOOj 



S^OPOO^oV 

aSooooooo 
'0000000 

OOOOOOOO^ 



0OOOOOOO 
00000000 
00000000 
00000000 




06000000/ 
OOOOO ZCti 
OOOOOOOl 

00000 ac 



0000 0000 
oooopooo 

OOOOIPOOO 



feW^loooV 
ooodoooo 



oodoooo 
ooodoooo 
OOootooo 
00000000 00000000 
ooeooooo 00000000 
00000000 Loooooooo 
00000000 >oooooooo 

00000000 00000000 
OOOOOOOO 00000000 

OOOOOOOO OOOOOOOO 
OOOOOOOO OOOOOOOO 
OOOOOOOO OOOOOOOO 
OOOOOOOO 
OOOOOOOO 
90000000 



OOOOOOOO 
OOOOOOOO 
OOOOOOOO 



OOOOOOOO 
OOOOOOOO 
OOOOOOOO 
OOOOOOOO 
OOOOOOOO 

00000000 

OOOOOOOO 
OOOOOOOO 

OOOOOOOO^ 
OOOOOOOO 
OOOOOOOO 

VTOoooo r , 
OOOOOOOO 
OOOOOOOO 

0000000 o / 
Sooooooo 

OOOOOOOO 
OOOOOOOO 
OOOOOOOO 
OOOOOOOO 

000000 00 

OOOOOOOO 
OOOOOOOO 

OOOOOOOO 
OOOOOOOO 
OOOOOOOO 
OOOOOOOO 
OOOOOOOO 
OOOOOOOO 
OOOOOOOO 
OOOOOOOO 




S3 



EP0570 729A2 




oopps 
loldpfos' 

1010001 

loto'soToM 

lOTO'CfOf 

lofofotof 
Tojojofaf 



EP0870 723A2 



200-s 

I APPLICATION 
PROCESSOR 0 



I 



210- 



APPUCAT10N 
PROCESSOR t 



220 



2x 



^1 



APPLICATION 
PROCESSOR 2 



rr 



250; 

i~ ~" ARRAY DIRECTOR 



260 



270- 



APPLICATION 
PROCESSOR 
INTERFACE 



230 



APPLICATION 
PROCESSOR N 



HI 



I ! 

240 



ARRAY 
CONTROLLER & 
SYNCHRONIZER 



TEST/DEBUG 
DEVICE 



ARRAY N 



ARRAY 02 J 



ARRAY 01 



ARRAY 00 



310 



300 
290 



260 



RG.5 



HOST APPLICATION PROCESSOR 



USER PROGRAMS 



COMPILE EXECUTE 



PERFORMANCE k DEMO 
PROGRAMS 



APPLICATION 
OEV UB 



RISC/SOOO TEST & DEBUG MONITOR 
0E30GGER 

PERFORMANCE MONITOR 

* ANALYSIS 
DIAGNOSTICS 
SIMULATOR 
ASSEMBLER AINKER 
k LOADER 



TEST & 
INTERFACE 
MONITOR 



RG.19 



ARRAY CTRLR 
HOST S'FACE 
M'TOR IT 
ARRAY If 




RS/6000 

MPP 
SIMULATOR 



65 



EP 0 670 729 A2 




3 
3* 



'8 

O 









16 BIT 
PROCESSOR 





66 



EP0670 729A2 




87 



EP 0 570 72$ A2 




to 





X 












< 



TTT 



* 


















r 






























fiQ _, 



00 
O 



U 



TTT 

X 



TT 



8 



to 

CD 



L 

Ln n 



o 



or os < 



68 



EP Q 970 729 A2 



500-^ 

"feu 
521-^. 



510 



T 



501 



525 



520 



x 



5tt-^ . L_J 



+x 



515 



530 



53V 



590^ 



56! 



591- 



545- 



561 -*\^ • i— ^ 



+Z 



593^ 



541 



540 



571 



A. 



535 



592- 



551^ 



555 



•550 



560 



~Z 



570 



6P0570 729A2 




ESO10 



EXTERNAL * 
PORTS I 



+x 



CMP/INST 



SERREO 



PE 
(+w) 



PE 
{+*) 



4y 



PE 

(+y) 



+2 



PE 

(+*) 



J 



B'CAST 

r 

FACE 



ff 



SERIAL LOOPS 



INTER 
PE 
BUS 

L 



B'ST 
BUS 

J 



PE 
(-w) 



PE 

HO 



EXTERNAL 
PORTS 



PE 

(-y) 
-y 



PE 
H) 



-z 



EISJ1 



70 



EP0670 729A2 



•Ski.. 




ARRAY DIRECTOR 
630^ i 640 



1 



APPLICATION 
H PROCESSOR 
INTERFACE 



NtCA BACKPLANE 



CLUSTER 
CONTROLLER 



650 



CLUSTER 
SYNCHRONIZER 



TTT 



FAST 1/ OjZIPP MER) 



620-^ 



1 6*5 

00 



ARRAY CLUSTERS 



BCI ^-' 605 



0 1 



670 



0 2 



680 



j>90 
ARRAY CLUSTERS 



i i 1 69< 

HZP 



FIG,1 2 



71 



EP 0 570 723 A2 



SYSTEMBUS- 



CLKS 



C0NT1? 



SHOTS 



t 't Ti "» i n 1 1 



8X8 NODE 
CLUSTER 

WITH x AND y 
PATHS DOTTED 

WTH 
SYSTEM BUS 
AT EDGE 



BI-DIRECTIONAL 
TKlSTATE DRIVERS 
{CONTROLLER 
ENA8LE) 



1 
1 



LUUUUUUL 



SYSTEM 
BUS 



DRIVER 



DRIVER 



E1GJ3A 

PE xO PE xl PE x2 PE x3 PE x!2 PE *13 PE x14 PE xlS 



MEM 



OEM 



Z 



MEM 



MEU 



E. 



MEU 



E 



MEM 



MEM 



MEU 



Z. 



72 



EP 0 570 729 A2 




73 



EP 0 570 729 A2 




74 



EP0G7O723A2 



HOW WOULD A 16 ELEMENT SORT REPEAT THE PATTERN? 

23 456 7 8 9 )0\ 




1100< 



FOR SORTING n DATA ELEMENTS (r, G \2':\ «N,2'^ $ OF PE*S}) 
do J * 0 to (log 2 n) - 1 do J « 0 to I 
If (PE#/2'~ J) 32 = 0 

t 5?2.K RGET " P£ #* 2 '"" J else TARGET = PE*-2'- J 
send DATA to TARGET 

receive dato store fn TEMP (It <fota is not cvoifable - waft) 

»(2<C^)xg*«^)*) + 1>*-0 

then If TEMP < DATA then DATA « TEMP else NOP 
then if TEMP > DATA then DATA » TEMP else NOP - 
end bothe do 3 



F1G.17 



75 



EP 0 670 729 A2 




DATA AND 
COMMANDS 



APPUCAT10M PROCESSOR 
INTERFACE 
API 



"T 



COMMAND 
DATA 



v 



ARRAY 
(CONTROLLER 



CLUSTER 
SYNCHRONIZER 

cs 



(OPTIONAL U 

CHANNEL 
DEVICES* DASO. 
DISPLAYS. 
GATEWAYS, 
ETC.) 



r 



/n DATA 
(u-CMANNEL) 



\ 



CLUSTER 
CONTROLLER 0 

cc 

(PORT TYPE) 



44? OATA 
PORT 



NORMAL 
8R0ADCA 
& STATUS 



CLUSTER 0 
64 MOOES 

(512 PMEs) 



\ 
\ 



CLUSTER 






CLUSTER 1 


CONTROLLER \ 




CONTROLLER N 


CC 








CC 


(NON-PORTED) 








I 

/16+ 




STATUS MONITOR 


J 


r je'CAST 




SERIAL LOOP 

















CLUSTER 1 



CLUSTER M 



. PME 
ARRAY 



rtai8 



76 



BP0 570 729 A2 



APPLICATION 
DEVELOPER ' 



APPLICATION 
OPTIMIZER ' 



CUSTOMIZER 
OR 

PROD. DEV*R 



VECTORIZED 
HIGH ORDER 
LANG SOURCE 



T 



APPLICATION 
DEVELOPMENT 

LIBRARY 
™I 



HOST 
COMPILER 



AND: 



•ALL 



EX. 6LAS. ESSL, 

PARALLEL 
' FUNCTIONS, 
ETC 



HOST 
EXECUTION 



APPLICATION 
PROCESSOR 
INTERFACE 



H VECTORIZED 
FUNCTIONAL 
EMULATOR 



(API CODE) 



PARALLEL 
OPERATION 
CONTROL CODE 



(PME CODE) 



* » * * * 



CLUSTER 
SYNCHRONIZER 



CLUSTER 
CONTROLLERS 
(n) 



FIG. 20 



B'CST 
INTER 
FACE 



PARALLEL I/O IVACE 



PROCESSING ELEMENT 
ARRAY 

(n CLUSTERS OF 512 PMEs) 



PARALLEL PROCESSOR 



l 



77 



EP 0 670 729 A2 



APPLICATION 
DEVELOPER 



SEQUENTIAL 
FORTRAN 
SOURCE 



•OR 



VECTORIZED 
HIGH ORDER 
LANG SOURCE 



APPLICATION 
OPTIMIZER ' 



APPLICATION 
DEVELOPMENT 
LIBRARY 



HOST 
COMPILER 



•AND: 



EX BUS, ESSL. 
PARALLEL 
FUNCTIONS, 
ETC. 



HOST 
EXECUTION 



APPLICATION 
PROCESSOR 
INTERFACE 



I 



VECTORIZED |— 
FUNCTIONAL 
EMULATOR 



CUSTOMIZER 
OR 

PROD. DEVR 



{API CODE) 



•ALL 



t 



PARALLEL 
OPERATION 
CONTROL CODE 



(PE CODE) 



* * • » « 



CLUSTER 
SYNCHRONIZER 



CLUSTER 
CONTROLLERS 

w 



--I 

-J 



PARALLEL 
FORTRAN 
COMPILER 
SYSTEM 



EKL21 



i i 



B'CST 
INTER 
FACE 



i 



PARALLEL \/0 l*FAC£ 



PROCESSING ELEMENT ■ 
ARRAY 

(r> CLUSTERS OF 512 PEs) 



PARALLEL PROCESSOR 



i 

— > 



78 




78 



EP 0 670 728 A2 



— 1 ^ 




<OUJ 


o< =5 


3£ 




















2 





to 



eo 



EP 0 670 729 A2 



CENTRAL 
TRACK 
FILE 








KALMAN SMOOTHING 
TRACK ESTIMATION 









O UJ UJ 



t 




§ h- tz i- 

Q |g LJ 

|±f 3 ^ 



at 
o 



81 



EP 0 570 729 A2 



INPUT 



CONSTRUCT THE SPARSE 
DATA ALLOCATION & 
SOLVE THE OBVIOUS CASES 



CALCULATE 
INITIAL 
LAG'N VALUES 



REDUCE N-DIM 
PROBLEM TO N-1 »M 



T 



UPDATE 

lagTn 

VALUES 



SOLUTION 



RECOVER 

n dim sax 

(NOT F'BLE) 



TEST IF MAX ON 
CCU* FOUND 



1 



CALCULATE 
INITIAL 
LAG'N VALUES 



REDUCE 4D1M 
PROBLEM TO 3-DIM 



CALCULATE 
INITIAL 
LAG'N VALUES 



UPDATE 
LAG'N 
VALUES 



RECOVER 
4 DIM SOL'N 
(NOT F'BLE) 



REDUCE 3-DW 
PROBLEM TO 2-DIM 



UPDATE 
LAG'N 
VALUES 



SOLVE 2D ASSIGN*! 
(METHODOLOGY: 

I) MUNKRES 
N) JONKER/VOLGENANT 



TEST IF MAX ON 
«(U) FOUND 



RECOVER 
3— DIM SOL'N 
(NOT F'BLE) 



TEST JF MAX ON 
0<U) FOUND 



62 




EP 0 670 720 A2 




83 



EP 0 570 720 A2 




80 



